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Abstract 


For nearly a century Ludwig Prandtl’s lifting-line theory remains a standard tool for understanding and 
analyzing aircraft wings. The tool, said Prandtl, initially points to the elliptical spanload as the most efficient 
wing choice, and it, too, has become the standard in aviation. 

Having no other model, avian researchers have used the elliptical spanload virtually since its 
introduction. Yet over the last half-century, research in bird flight has generated increasing data 
incongruous with the elliptical spanload. 

In 1933 Prandtl published a little-known paper presenting a superior spanload: any other solution 
produces greater drag. We argue that this second spanload is the correct model for bird flight data. Based 
on research we present a unifying theory for superior efficiency and coordinated control in a single solution. 
Specifically, Prandtl’s second spanload offers the only solution to three aspects of bird flight: how birds are 
able to turn and maneuver without a vertical tail; why birds fly in formation with their wingtips overlapped; 
and why narrow wingtips do not result in wingtip stall. 

We performed research using two experimental aircraft designed in accordance with the fundamentals 
of Prandtl’s second paper, but applying recent developments, to validate the various potentials of the new 
spanload, to wit: as an alternative for avian researchers, to demonstrate the concept of proverse yaw, and to 
offer a new method of aircraft control and efficiency. 


Introduction 


In 1922 Ludwig Prandtl published his “lifting line” theory in English; the tool enabled the calculation 
of lift and drag for a given wing. Using this tool results in the optimum spanload for minimum induced drag 
(the greatest efficiency) for a given span, which, Prandtl said, was elliptical (ref. 1). Since then, the lifting 
line theory and elliptical spanload have become the standard design tool and wing spanloading in aviation. 
So ubiquitous is it that avian researchers have relied on it to explain bird flight data almost since its 
introduction. But in 1933 Prandtl published a second paper on the subject in which he conceded that his 
first conclusion was incomplete: there was a superior spanload solution to maximum efficiency for a given 
structural weight. “That the wingspan has to be specified,” he wrote, “leads to the invalid assertion that the 
elliptical distribution is best” (ref. 2). His new bell-shaped spanload creates a wing that is 11 percent more 
efficient and has 22 percent greater span than its elliptically-loaded cousin, all while using exactly the same 
amount of structure. It results in the minimum drag solution in every case of physical wings: any other 
solution will produce greater drag. Oddly, Prandtl’s second spanload remains virtually unknown. 

Sometime around 1935 Reimar Horten independently derived an approximate equivalent to Prandtl’s 
1933 solution. Horten dubbed it “bell shaped” for its wing loading. The extant evidence shows sufficient 
differences between the two men’s methods, objectives, and conclusions to exclude any mingling of 
information on this subject despite being contemporaries. While Prandtl calculated the total induced drag 
for a wing with this new spanload, he did not examine the distribution of the induced drag across the span, 
and so he missed its implications. Horten, on the other hand, did calculate the induced drag across the span 
of the wing, and in 1950 concluded that something singularly possible existed with such a spanload, 
although he never conclusively proved it (refs. 3, 4). What Prandtl missed and Horten believed existed with 
respect to the alternate spanloading (the bell) is proverse yaw. Figure 1 shows the elliptical and bell 
spanloads of Ludwig Prandtl. 
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Figure 1. The elliptical and bell spanloads of Ludwig Prandtl. 


Figure 1(a) shows Prandtl’s elliptical spanload from 1920 and the bell spanload from 1933. The symbol 
gamma (1°) signifies the airflow circulation about the wing. Figure 1(b) shows the matching downwash 
(dw) of the elliptical spanload (1920) and the downwash of the bell spanload (1933). In figure 1(c) the 
upwash outboard of the wingtip is shown. Figure 1(d) shows the 1920 Prandtl elliptical spanload downwash 
and upwash (note the sharp discontinuity at the wingtip, which is the wingtip vortex). Figure 1(e) shows 
the 1933 Prandtl spanload downwash and upwash (in contrast to the 1920 solution, note the smooth, 
continuous upwash across the wing and beyond; the wing vortex is now inboard of the tips). A comparison 
of the flow fields resulting from the elliptical and bell spanloads is shown in figures 1(d) and I(e). The 
elliptical spanload wing, figure 1(d), has a sharp discontinuous slope at the wingtip span location in the 
upwash (this is the location of the wingtip vortex), in contrast to the smooth curve of the new upwash, 
figure 1(e) with no discontinuity (a weak vortex forms at the point where the downwash crosses the zero 
line and becomes upwash). 
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Prandtl’s 1933 solution is stated as 
L= ( 1— x’) 3/2 


where L is the nondimensional local load (this is also expressed as gamma or I’); and x is the span location 
between 0 and 1. Subsequently, 


DW = 3/2 (x*-%) 


where DW is the nondimensional downwash (angle) of the flow. 
The lift approaches zero at the wingtip, as shown in equation (1): 


a Ce 
x: 0 > b/2 (1) 


The slope of the lift (as a function of span) approaches zero at the wingtip, as shown in equation (2): 


lim dL(x) =0 (2) 
x: 0 — b/2 dx 


The slope of the upwash (as a function of span) at the wingtip 1s equal on both sides of the wingtip, as 
shown in equation (3): 


lim dDW(x) = lim d DW(x) (3) 
x:0—b/2 dx x:00—>b/2 dx 


Induced Drag, and Adverse and Proverse Yaw 


It 1s critical to understand the airflow and forces exerted on a wing during flight, including lift and 
induced drag, to appreciate the differences between the elliptical and bell spanloads, and the implications 
for birds and aircraft. 

Ludwig Prandtl described both the elliptical (1920) and bell (1933) spanload distributions as shown in 
figure 2. The 1920 elliptical spanload, figure 2(a), describes a wing with a uniform downwash along the 
wing’s trailing edge, and a sharp discontinuity of downwash and upwash at the wingtip, which results in a 
strong, tightly-rolled vortex formed at the wingtip. In contrast, the bell spanload describes a wing having a 
downwash that varies from strong downwash near the wing root, which tapers outboard 
(past b/2 = 0.704), to upwash near the wingtip. The bell spanload is also much more heavily loaded 
(more net force) in the root area of which the large root downwash is a consequence. The significance of 
these disparate characteristics is both subtle and dramatic. 
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Figure 2. Prandtl’s elliptical and bell spanloads explained. 


Figure 2(a) shows Prandtl’s elliptical spanload from 1920; figure 2(b) the bell spanload from 1933. The 
symbol gamma (I) signifies the airflow circulation about the wing. The matching downwash (dw) of the 
elliptical spanload and of the bell spanload for each are also shown. The upwash on the 1920 Prandtl 
elliptical spanload is outboard of the wingtip. Of importance in the elliptical spanload shown in figure 2(a) 
is that the net force vector field is tilted backwards by the constant downwash along the entire span of the 
wing. The resulting horizontal component of the resultant force (I°) manifests itself as induced drag across 
the entire wingspan. By contrast, in figure 2(b) it can be seen that the 1933 Prandtl bell spanload and 
downwash show the twisted downwash crossing the zero line and becoming upwash near the wingtip. The 
resultant force is tilted forward of the vertical and the horizontal component is manifested as induced thrust 
at the wingtip, due to the resulting upwash. 

Airflow over a wing generates a net force, which is approximately normal to the wing chord. As shown 
in figure 2(a) for an elliptic spanload, this resultant force vector is not exactly perpendicular to the airflow. 
The larger component perpendicular to the relative wind is known as lift. For finite wings there exists a 
component parallel to the relative wind (for elliptical spanload, always in the direction with the wind, that 
is, toward the trailing edge) that is referred to as induced drag. Induced drag 1s the “cost” of producing lift 
with a finite wing. As lift increases, induced drag also increases. Thus, any control surface deflected to 
locally produce more lift will also locally produce more drag. Ailerons deflected anti-symmetrically to 
generate a rolling moment will also produce a yawing moment to the outside, or against the turn being 
generated by the roll. This phenomenon is referred to as adverse yaw and is the reason all aircraft with an 
elliptical spanload require an auxiliary yaw device (typically a rudder, courtesy of the Wright brothers in 
1902 [refs. 5, 6]) to counter the adverse yaw in order to coordinate the turn (yaw with the turn). 

For the bell spanload, shown in figure 2(b), the net force vector is such that it varies along the span. 
Inboard, the force vector is tilted away from the relative wind, like that of the elliptical spanload case, and 
the parallel component produces induced drag. Progressing outboard, this parallel component reduces in 
magnitude until it eventually (past b/2 = 0.704) is tilted into the relative wind. This phenomenon is referred 
to as induced thrust (that is, negative induced drag). It should be noted that the sum total force of this parallel 
component is still producing a net drag (and this sum total is more than that of an elliptical spanload for the 
Same span - in our case we are able to increase the span and achieve less total induced drag), but locally for 
the outer 0.296 span, it produces thrust. A control surface placed in this local thrust region will generate 
increasing thrust with increasing lift. Thus an aileron located in this region will produce a yawing moment 
into the turn, which moment is referred to as a proverse yawing moment. A properly designed aileron, on 
a bell spanload wing, could produce just the right amount of proverse yaw such that an auxiliary yaw device 
would be entirely unnecessary for coordinated turning flight. A design without an auxiliary yaw device 
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means there would be neither added drag nor complexity from such a device. It should be noted that this 
does not mean for all aircraft designs employing a bell spanload that an auxiliary yaw device would not 
needed. There are instances on modern aircraft designs (for example, engine-out or crosswind landing) in 
which such a device would still be needed, but there could be designs for which such a device would not 
be required. Various bird species appear to maneuver gracefully with very minimal auxiliary yaw devices, 
and some seemingly have no discernible auxiliary yaw device. Further, as Prandtl pointed out in 1933, it 1s 
possible to extend the wingspan of a bell spanload, achieve the same lift and the same integrated wing 
bending moment, and achieve less induced drag than the equivalent elliptical spanload. This final solution 
is examined here. 

The downwash/upwash curve of the bell spanload is one smooth and continuous function from beyond 
one wingtip, across the wing to beyond the opposite wingtip. Note that the slope of the downwash/upwash 
function will also be continuous across the wing. The upwash curve rises from the equilibrium level of the 
air far beyond the wing tip to a gentle peak at the maximum upwash of the wing, located outboard of the 
wingtip, which we show as an extension of Prandtl’s downwash/upwash. Inboard of the peak, the upwash 
decreases and meets the upwash of the wing at the wingtip, and the two upwash curves, inboard and 
outboard, must be of equal slope. 

As with any other aircraft, to turn we deflect the control surfaces near the wingtips, increasing the lift 
near one wingtip, resulting in the desired bank angle. But when we increase the lift on one wingtip the 
resulting induced thrust also increases (there is always thrust at the wingtips). The raised wing will create 
more thrust than the lowered wing, resulting in both bank and yaw in the direction of the turn: proverse 
yaw. As a result of this proverse yaw, coordinated flight is achieved without the need for a tail, rudder, or 
other drag devices. 

Figure 3(a) shows the downwash field behind a wing using the bell spanload through use of twist 
(Marko Stamenovic). Figure 3(b) shows the vortex roll-up behind the wing, analytical and in flight 
(Marko Stamenovic and Tom Tschida, NASA photo). 
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Figure 3. Downwash field, wing vortex roll-up, and resulting wingtip overlap in bird formations. 


The Experiment 


To validate the theory and the most critical principles of the bell spanload, we conducted an experiment 
using two subscale flying wing aircraft that used wing twist to achieve the selected bell shaped spanload. 
The model planform with a bell shaped spanload based on Prandtl’s theory was a 25-percent Horten H Xc 
aircraft (12.3 ft span) with a design lift coefficient of 0.6. The objective of the experiment was to 
demonstrate coordinated flight with proverse yaw for an aircraft with a bell shaped spanload and no vertical 
surfaces of any kind on the aircraft. The elevons have equal and opposite throws while functioning as 
ailerons. There is no differential bias; this is a direct, stick to surface control system. 


The Bell Spanload Aircraft Experiment 


The radio-controlled aircraft were bungee-launched and flown by a pilot on the ground. Bungee tension 
was roughly 50 |b at release and typical altitude at separation from the bungee/cord was 200 ft above ground 
level. The pilot flew the aircraft on a single racetrack pattern during its descent and landing on the dry 
lakebed from whence it launched, completing various flight dynamics maneuvers en route to collect data. 
Flight times increased as experience grew, reaching a maximum flight time of 1 min 55 s and averaging 
nearly 1 min 22 s per flight on aircraft no. 2. Nearly 3 hr of flight time has accumulated. 

The first aircraft carried an on-board data collection system: a smartphone with a triad linear 
accelerometer and triad angular rate recording application. This aircraft also later flew with a 
microcomputer-based flight data recorder providing basic inertial measurement unit functionality (pitch 
rate, roll rate, yaw rate, airspeed, and heading). The data sensors included global positioning system, 
pitot/static system, alpha/beta probes, and control position transducers. Configuration 3 had an open-source 
data recorder and autopilot, inertial measurement unit, global positioning system, pitot/static system, 
alpha/beta probes, and control position transducers. All data-gathering and -generating systems were 
calibrated before flight. Data were downloaded after each flight for later analysis. 


Mass Properties 


The aircrafts’ mass properties are: roll inertia 5.425 slug-ft*; pitch inertia 0.2717 slug-ft? (estimated); 
yaw inertia 5.818 slug-ft?; and x-z plane cross product of inertia 0.5054 slug-ft*. The inertias were measured 
using a bifilar method, except for pitch inertia, which was estimated from the computer-aided design 
geometry and the point mass locations of the onboard systems. The center of gravity was placed at 0.128 
of the mean aerodynamic chord. The aircraft mass was 14.5 lb. The lateral-directional mass properties 
proved to be critical to the experiment. Maine and Iliff (ref. 7) show a very high sensitivity to x-z plane 
cross product of inertia in the estimation of Cnda (yawing moment due to aileron deflection coefficient). 


Data Parameters 


We gathered flight mechanics data for the aircraft with instrumentation for the following parameters: 
angle of attack (-20 to 70 deg); angle of sideslip (-45 to 45 deg); total pressure (0 to 2.16 lb/ft’); static 
pressure (0 to 2.16 lb/ft”); normal acceleration (+/-6 g); axial acceleration (+/- 4 g); lateral acceleration (+/- 
4 g); roll rate (+/- 200 deg/sec); pitch rate (+/- 200 deg/sec); yaw rate (+/-100 deg/sec); left elevon deflection 
(+/- 90 deg); and right elevon deflection (+/- 90 deg). The sampling rate was 20 samples per second for all 
parameters. Open-source microprocessor systems were used for all data collection. 


Preliminary Design 


We performed preliminary design analyses using two methods: a vortex-lattice model paneling the 
aircraft as 320 discrete surfaces (ref. 8), each of the discrete surfaces with its own angle; and a build-up of 
two-dimensional airfoil panel methods (7 span locations, with 5 control surface deflections, 5 chord 
Reynolds numbers varying from 200,000 to 2,000,000, and at 9 angles of attack from -2 deg to 10 deg). 
The build-up of the two-dimensional airfoils was integrated in MATLAB (The MathWorks, Inc., Natick, 
Massachusetts). The airfoils and twist are detailed in Tables 1, 2, and 3. A converged solution was declared 
between the two result sets when we achieved a four-significant-digit match. The airfoils were 
custom-designed using the Eppler code (refs. 9, 10). Estimates of the control surface effectiveness were 
made from the vortex-lattice results and adjusted on the basis of boundary layer thickness. The control 
surface effectiveness was also adjusted on the basis of the control surface configuration change (plain 
surface to plain surface with balance added) for the scale of the aircraft. The model scale was set at 25 
percent, however, the mass of the vehicle increased due to the addition of the instrumentation package. The 
resulting wing loading placed the subscale aircraft in the range of the full-scale wing loading; consequently 
the subscale aircraft flew at velocities closely matching the full-scale aircraft predictions. The wingspan is 
12.3 ft with a leading-edge sweep of 24 deg at the nose. 

With this in mind for the analysis, the common use of Oswald’s efficiency factor (“‘e’’) is also not 
appropriate for bell spanloads. Perhaps the invention of a Prandtl efficiency factor (“p’’) or a bell efficiency 
factor (“‘b’”) should be used for these Prandtl 1933 bell spanloads. A comparison of the elliptical and bell 
spanload efficiency parameters is given in table 4. 

The coordinate frame for the flight mechanics data on the aircraft is: origin at the center of gravity 
(12.875 inches aft of the nose); x-axis is positive forward out the nose; y-axis 1s positive out the right wing; 
and Z-axis is positive down out the bottom of the aircraft. Using a right-hand convention; roll is rotation 
about the x-axis and is positive for roll right; pitch is rotation about the y-axis and is positive for pitch up; 
and yaw is rotation about the z-axis and is positive yaw right. 

The coordinate frame is: for the wing definition, the x-axis origin is at the wing centerline and extends 
to b/2 (half-span). The y-axis 1s defined as vertical upward (though in specific cases it 1s defined otherwise, 
and the convention should be apparent by the context). 

The aircraft design was generated to produce a bell spanload. The airfoils vary continuously and linearly 
from the centerline to the tip. The airfoils are specified in nondimensional coordinates. The airfoils used 
are shown in tables 1 and 2. 


Table 1. Airfoil section, centerline. 


0.37059 | 0.10000 | 0.00000 _| 0.00000] 0.37059 |-0.01904 | | 





Table 2. Airfoil section, wingtip. 


0.4062 


0.68325 


0 

0 | ji 
0 | 
0 


0.43132 | 0.04453 | 0.00000 _| 0.00000 _| 0.43132] -0.04453 | 


6 ; 
6 
04556 : 


The wing twist is nonlinear; it 1s specified at 20 intervals from the centerline to the wingtip, in degrees, 
as shown in table 3. Using the above airfoil coordinates, this twist does not require any compensation for 
aerodynamic twist relative to the geometric twist. 


Table 3. Wing twist distribution. 


}O | 8.3274 | 11 | 7.2592 _ 


/6 | 8.8257 | 17 | 1.9394 _ 
/8 | 8.4565 | 19 | -0.6417_ 
/9 | 8.1492 | 20 | -1.6726_ 
(104 7.7522{ | 


The control surfaces are located in the outboard 14 percent of each wing, in the trailing 25 percent of 
the chord; the round tips are included as part of the control surfaces. The wingspan is 12.3 ft, the wing area 
is 10.125 ft’, the centerline chord is 15.75 in., and the wingtip chord is 3.94 in. The wing had 2.5 deg of 
dihedral. 





Table 4. Elliptical spanload and bell spanload comparison of spanload parameters and efficiency factors. 


pe 000 8889 


This comparison is made using the traditional elliptical spanload as the baseline from which the bell 
spanload is compared. 

In figure 4, three spanloads (blue = -5 deg; red = O deg; and green = +5 deg) are plotted showing the 
effect of sideslip on the area of induced thrust near the wingtips and the resulting effect on yawing moment. 
The light-green line shows a large area of induced thrust on the left and a small area of induced thrust on 
the right, which would result in a large right-yawing moment. 
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Figure 4. Effect of sideslip on bell spanload with twist (O and +/- 5 deg). 
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Aerodynamic Coefficient Estimation 


We used Maine and Iliff’s output-error approach to estimate the aerodynamic coefficients, which was 
the same source used for the estimates of Cnda. We assume the aircraft is a continuous-time dynamic 
system. The process of estimating the aerodynamic coefficients is an exercise in system identification. 
Assumptions were made in this process, many based on previous experience, such as which parameters 
were important (these are retained) and which were not (these are ignored). This approach uses a 
formulation of the solid-body aircraft flight mechanics as a linear simulation of the vehicle. An initial 
estimate is made of the aerodynamic coefficients; the simulation then makes an estimate of the vehicle 
motion based on the aerodynamic coefficients, the mass properties, and the equations of motion, after which 
the linear estimates are compared to the measured flight data from the vehicle. Errors from all of these 
measurements subject the final estimates of the aerodynamic coefficients to uncertainty. The errors between 
the simulation output and the measured data are subjected to a measurement based on a weighted 
error-based cost function defined by the researchers. 

The aerodynamic coefficient estimates are then varied, and slopes or gradients are determined 
numerically from the errors. The estimation program then marches toward minimizing the cost function 
from the “fit” between the output of the linear simulation and the measured flight data. Maneuvers were 
simple doublet maneuvers, which are simple square-wave pulses, both positive, followed immediately by 
a similar pulse of the opposite sign. 

The results of the flight research on the small flying wing glider were successful, as can be seen in 
figure 5(a). We measured proverse yaw in flight for the first time on June 27, 2013. A sample output from 
the flights shows proverse yaw, as shown in figure 5(b). 
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Figure 5. A subscale aircraft in flight, and resulting proverse yaw data trace. 


Figure 5(b) shows a data trace of the angular rates from onboard instrumentation. Red 1s pitch rate, blue 
is roll rate, and green is yaw rate. The high-frequency motion in pitch rate is due to air turbulence. The yaw 
motion following the roll motion is the same sign; the yaw gain is 0.0643 and correlation is 0.77 for this 
maneuver. All rates are to the same scale. 
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We calculated the coefficient Cnda from the flight research maneuvers. In figure 6, the scattered dots 
represent the flight research maneuvers. The value of Cnda is positive and the trend of the slope is also 
positive. The degree of scatter in the data is a result of all experimental error. From this we see that Cnda 
is providing the yawing moment in the same direction as the rolling moment. 
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Figure 6. Yawing moment due to aileron deflection coefficient versus lift coefficient. 


The yawing moment due to aileron deflection coefficient (Cnda) is shown plotted against lift coefficient 
in figure 6. The blue line and red circles were predicted for the full-scale aircraft. The black line is an 
estimate of the 0.25-scale experiment aircraft. The black dots are the estimates from flight data. Error bars 
are 5x Cramer-Rao bounds. The straight line with the circles represents the analytical data from the vortex 
lattice. An estimate of the effect of scale 1s made on the vortex-lattice (reducing the scale reduces the 
effectiveness of the control deflection and reduces the resulting yawing moment). The good comparison 
between the predicted and the measured flight Cnda confirmed our expectations regarding Prandtl’s bell 
spanload. 


Birds and the Bell Spanload 


There are at least two larger implications of this work. The first 1s for the avian research community; 
the second is for the aeronautical world. 

That birds have no vertical tail yet effect effortless turns remains a puzzle, inasmuch as all avian flight 
research is analyzed using the elliptical spanload. The matter of formation flight also defies satisfactory 
explanation despite a century’s worth of research, analysis and effort, again in large part because the 
analysis relies entirely on the elliptical spanload. (“The wake [of the kestrel] was found to be similar to that 
measured behind an elliptically loaded airfoil of the same span,” wrote Geoff Spedding when analyzing his 
data. “As a result, classical airfoil theory for an elliptically loaded wing was used to calculate parameters 
such as lift coefficients and efficiency factors” (ref. 11). Less apparent but equally puzzling to close 
observers is the shape of birds’ wings when compared to aircraft wings: the former taper, often to a sharp 
point, while the latter rarely do, and this, too, defies the elliptical spanload solution. The load distribution 
over a bird’s wing is far more gradual than an elliptical spanload provides: consider a birds’ wing 
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structure—both skeletal and on the surface—which tapers to almost nothing near the tips, where the 
outermost feathers carry virtually no load at all, as compared to an aircraft’s wing. An elliptically loaded 
aircraft’s wings carry loads right to the wingtip. 

First, based on our research results we assert that the growing data on bird flight is irreconcilable so 
long as it relies on the elliptical spanload as the analytical tool. Second, based on the analytical results of 
the bell spanload and the flight data, we assert the only viable solution for interpreting bird flight, formation 
flight, and bird wing structure is the bell spanload. 

We know that it is a biological imperative that birds carry no excess structure in their wings or chest 
muscles, only as much as muscle, tendon, and bone as necessary. Birds embody minimum structure while 
achieving maximum aerodynamic efficiency while accomplishing coordinated flight: birds are a solution 
to a multivariate optimization. Recall that Prandtl’s second paper provided a spanload solution to maximum 
efficiency for a given structural weight when the wingspan need not be constrained. The bell spanload 1s 
the only explanation for how birds achieve this multivariate solution (refs. 12-15). 


Birds-Bell Spanload; Airplanes-Elliptical Spanload 


1. Birds’ primary feathers are soft and flexible at their wingtips and the wings have a narrow chord; 
these wingtip feathers are incapable of supporting any substantial load. Additionally, the outboard wing 
structures of birds are long and slender. The ligaments, tendons, supporting muscles, and bones are long 
and thin, improving aerodynamic performance, but the load-carrying ability of these structures is very 
modest (the same was true for pterosaurs). In contrast, aircraft wingtip structures are large, heavy, and 
expected to carry real loads in flight. 

2. Birds flying in formation position themselves to capture upwash from a leading bird’s wing vortex 
roll-up for added efficiency. Data shows they do this with wings overlapped. Aircraft flying in formation 
with similar objectives do not match this profile, however: they fly with wingtips in line. 

3. Birds do not experience wingtip stall even with their narrow-chord, sharp-tipped, wings. But when 
sharp-tipped swept wings are used on aircraft, wingtip stall is common and requires other solutions to 
overcome. 

We are accustomed to seeing birds turn and maneuver without a vertical tail, and only seeing aircraft 
do so using such drag-inducing devices. The ability to turn and maneuver without resorting to drag-inducing 
devices to counter adverse yawing forces is the first evidence for why the bell spanload—which generates 
proverse yaw—explains the flight of birds. 

Figure 7 shows a wandering albatross (diomedea exulans) in flight. The wandering albatross has no 
vertical tail, yet these birds are able to expertly fly so that they precisely touch their wingtips to the water. 





Photo: Phil Barnes 
PelicanAG.com 
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Figure 7. Wandering albatross in flight. 
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Researchers such as Wieselsberger (ref. 16), Lissaman and Schollenberger (ref. 17), and Portugal have 
argued that flying in formation allows birds to capture upwash in the air from the wing vortex roll-up. There 
is no dispute that birds maximize the energy from the upwash, something only possible in formation flight. 
What we dispute is where that vortex occurs on birds. Figure 8(a) shows a formation of pelicans flying with 
wingtips overlapped, which is an optimal arrangement with the bell spanload but suboptimal for the 
elliptical spanload because in this case the vortex roll-up is not at the wingtip but inboard of the wingtip 
(at .704 of the semi-span) and is in fact a wing vortex roll-up, not a wingtip vortex roll-up. Spedding’s data 
support this, as can be seen in figure 8(b); the vortices seen behind his kestrel show a vortex roll-up inboard 
of the wingtips. (“This wing loading distribution [elliptical] is reflected in the geometry of the wake,” he 
wrote). Birds position themselves in formation flight based on the location of the actual vortex roll-up, and 
only the bell spanload generates a vortex roll-up in that location. 


a b 
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Figure 8. Significant bird flight characteristics; a: Formation flight of brown pelicans (pelecanus 
occidentalis) demonstrating the resulting wingtip overlap; and b: Spedding’s kestrel (falco tinnunculus) 
data showing an inboard vortex core location. 


Figure 9(a) shows spanwise location data of following bird relative to the lead bird in the northern ibis 
(geronticus eremita) from Portugal, with Hainsworth (ref. 18), Cutts & Speakman (ref. 19), and Speakman 
& Banks (ref. 20). Figure 9(b) shows an overlay of the data sets with our addition of the downwash curve 
of the Prandtl 1933 spanload and our extension of Prandtl’s 1933 theorum. 
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Figure 9. Bird position in formation flight. 


Portugal recently published research based on global positioning system data showing northern ibis 
flying in formation with the tips of their wings overlapping. He concluded that the mean spacing was 
0.904 m on a mean wingspan of |.2 m, for a vortex core separation of 0.753. Spedding gave the vortex core 
separation of his kestrel research as 0.76 of span. The vortex separation on our research flying wing aircraft 
occurred at 0.704 of the semispan. Portugal, like Spedding and others before him, analyzed his results using 
the elliptical spanload, forcing the analysis of the birds to fly formation with their wingtips in line with each 
other rather than with wings overlapped. Birds in formation flight seek out the greatest upwash, and there 
is a clear, strong correlation between the location data of birds in formation flight and the vortex formation 
and upwash data of the bell spanload. 

How are birds able to fly with pointed wingtips? Note how the lift tapers gradually to zero at the wingtip 
with the bell spanload. The result is that even wings with very strongly tapered tips show no tendency to 
wingtip stall. Rather than occurring at the wingtip (as it will with an elliptical spanload) the stall begins 
about 20 percent out from the wing root, something observable in the flight of birds [figure 10(b)]. Because 
the bell spanload creates proverse yaw in the outer third of the wing, the thrust yields controllability even 
with a sharply tapered wing. 

The upwash at the tips of the bell spanload makes it possible to capture the wingtip-induced thrust that 
can then generate coordinated roll and yaw without resorting to the use of a vertical tail and without 
generating drag at the wingtips. If we accept Prandtl’s 1933 lift distribution as useful for birds, it follows 
that birds are manipulating thrust at their wingtips to control yaw. 
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Figure 10. The local lift coefficient (Cl) and the beginning of stall on the wing of a wandering albatross. 


Figure 10(a) shows the local lift coefficient as a function of span for a bell spanload from centerline to 
the wingtip. Note that the highest point on the curve is the area in which the wing would first stall. Figure 
10(b) shows an image of a wandering albatross soaring at low speeds (image by Jeff Jennings). Ruffled 
feathers indicate the beginnings of the stall, at approximately 20 percent of span, not near the tip, matching 
the bell spanload predictions. 

Combining observational evidence and data developed by avian researchers with our own research 
results, we assert that only the bell spanload provides a coherent paradigm for bird flight. Our research 
offers for the first time a theory and a tool derived from flight test that satisfactorily explains bird flight to 
match the data. It also serves as a solution to far more efficient aircraft flight. 


Conclusion 


The bell spanload maximizes aerodynamic efficiency with a given structure, coordinates the roll-yaw 
motion so that birds are able to turn and maneuver without a vertical tail, and explains why birds fly in 
formations with their wingtips overlapped, as well as how birds use narrow wingtips without experiencing 
tip stall. 

The bell spanload also allows for improved aircraft designs, particularly all flying-wing aircraft and 
blended-wing body aircraft. Even conventional tailed aircraft can benefit from the improved aerodynamics 
and minimum structure approach. There are circumstances in which span constraints exist (such as 
extremely large transport category aircraft), in which cases current approaches provide better solutions. 

Neither Prandtl nor Horten followed through to the logical and complete conclusion of their work. 
Prandtl did not extend the upwash outboard of the wingtip, which would have answered the question of 
formation flight in birds, and he did not find the induced thrust at the outboard ends of the wings, which 
leads to proverse yaw. In turn, with his approximation and objectives Horten did not understand the origin 
of the induced thrust at the outboard ends of the wings for proverse yaw, and he did not prove that proverse 
yaw exists. 

It remained for the current authors to prove conclusively that proverse yaw is achievable through an 
efficient bell-shaped spanload, that an optimal solution integrating minimum structure and minimum drag 
can solve the problem of yaw control and stability of a flying wing, and that the bell spanload solution 
answers some of the great enduring mysteries of the flight of birds. 

In the case of the flight of birds, the bell spanload is the only viable solution. 
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The Mechanical Analog Computers of 
Hannibal Ford and William Newell 


A. BEN CLYMER 


The history of mechanical analog computers is described from ear/y devel- 
opments to their peak in World War II and fa their obsolescence in the 1950s. 
The chief importance of most of these computers was their contribution to the 
superb gunnery of the US Navy. The work of Hannibal Ford, William Newell, 
and the Ford Instrument Co. is the framework around which this account Is 


based. 


F or over 40 years mechanical analog computers provided 
the US Navy with the world’s most advanced and capa- 
ble fire-control systems for aiming large naval guns and 
setting fuze times on the shells for destroying either surface 
or air targets. A large part of this preeminence can be 
attributed to the work of Hannibal Ford and William New- 
ell. However, the credit has usually been withheld. first 
because of security classifications and later by the resulting 
widespread ignorance of even the main facts of their stories. 

The history of the evolution of fire-control equipment 
can be divided into three crudely defined periods of prog- 
ress: early. middle. and late, being respectively the eigh- 
teenth, nineteenth, and twentieth centuries. In the early 
period, the eighteenth century. there was no perception of 
fire control as a hierarchical system. so there were no inven- 
tions on the #¥4fe'#? Jevel. Lack of concern for improvement 
caused continuation of the status quo. In the middle period, 
the nineteenth century. there began a trend toward automa- 
tion in many practical pursuits (e.g., the cotton gin. railroads. 
steamboats. and glass-forming machines) which extended to 
naval gunnery. Handwheels provided a mechanical advan- 
tage in training and elevating guns. The man-machine sys- 
tem was being made easier and better for the men by 
delegating more to machines. 

In the late period, the twentieth century. people have 
seen the system as a whole, and they have been conscious 
of missing subsystems. Inventions then took place on the top 
echelon, and system engineering began to deal with the 
entire hierarchical system. In the late period there was 
concern for errors of system performance. In the case of a 
fire-control system, the contributions of all causes to the 
ultimate miss data were studied to identify ih most critical 
remaining sources of error. 


Early analog computing mechanisms 


To understand the types of mechanisms invented by 
Ford and Newell, it is necessary to briefly examine a few of 
the simple components from which they arose. The history 
of mechanical analog devices goes back at least to Vitruvius 
(SO BC), who described the use of a wheel for measuring arc 


length along a curve. the most simple integral in space. Many 
other elementary analog devices were described before the 
modern period: Differential gears (Figure 1). used for add- 
ing or subtracting two variables. arc usually ascribed to 
Leonardo da Vinci: and Leibniz is credited for the idea late 
in the seventeenth century of a similar-triangles device for 
equation solving or root solving.’ 

The first device to form the integral under a curve, or the 
area within a closed curve, was the integrator of B.H. Hetr- 
mann in 1814. Hermann’s integrator was essentially a wheel 
pressed against a disk. as shown in Figure 2. There was a 
second disk over the first. which squeezed the wheel be- 
tween them. The rate of rotation of the wheel is proportional 
to the product of the disk rotation rate and the radial 
location of the point of contact of the wheel on the disk. That 
is. the rate of change of angular position of the wheel z is 
given by 


d*> dy 
“=kKy 
dr ay 


where z is the time integral of ¥ times a constant. x is the 
angular position of the disk, and K is a scale constant. Note 
that the variables in this device are angular and linear 
positions. 

An early application of such integrators was the integra- 
tion of force over distance to measure work. Another appli- 
cation was a planimeter to measure the area within a closed 
curve. In fact. the chief impetus behind the early integrator 
inventions of the nineteenth century was to get an improved 
planimeter. 

James Clerk Maxwell’ described a ball type of integrat- 
ing device while he was an undergraduate: it was incorpo- 
rated in a planimeter design. In about 1863. James Thom- 
son’ conceived an equivalent integrator in which a ball 
rotates between the disk and a cylinder (see Figure 3). The 
angular position of the cylinder is the output variable z. and 
the ball replaces the wheel of the Hermann integrator. The 
ball is held in a housing that is translated along the radius of 
the disk with displacement y. This integrator became the 
heart of numerous harmonic analyzers and time analyzers. 
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Figure 2. The Hermann integrator. 


In 1881 a different type of integrator was developed in 
Madrid by V. Ventosa. It consisted of a tilluble drive roller. 
a ball, and four output rollers. If wind velocity is put into the 
drive roller (marked “A” in Figure 4) as angular velocity. 
and if wind direction is put in as tilt angle. then the four 
output rollers turn with speeds proportional to the compass 
components of wind velocity. As a computing device this 
ball constitutes a “component integrator’ -it produces the 
time integral of the sine and cosine components of a given 
varying magnitude. Later forms of trigonometric integrators 
were developed by Hele-Shaw, Smith. Newell (see Appen- 
dix), and others. 

Harmonic analyzers were developed to determine the 
coefficients of a Fourier series to fit a given record. such as 
tide data. Lord Kelvin built two. the second in I4/9. A 
refined version by Michelson and Stratton built in 1897 
could sum 80 Fourier terms. According to Vannevar Bush” 
a three-dimensional cam for multiplying was developed by 


Bollée. 





A two-dimensional cam (Figure 
5) was used to generate a virtually 
arbitrary function of one variable: 
The input is the rotation angle of 
the cam, and the output is the radius 
of the cam at the point of contact of 
a roller. A three-dimensional cam 
(Figure 6) was similarly used to 
generate a function of two vari- 
ables. such as time of flight as a 
function of range angle and eleva- 
tion angle to the target. 

William Thomson. Lord Kelvin, 
had the powerful idea of using ana- 
log computing mechanisms tied to- 


Figure 1. The Ford 3/8-inch spur differential gears. (Photograph by Laurie Minor, &cther to solve a differential equa- 


tim." Ten years later, Abdank- 
Abakanowicz built an “integraph,” 
which had the purpose of solving 
one particular differential equa- 
tion. Thomson’s idea was the con- 
ception of differential analyzers. 
which. however. did not become a 
practical reality until the 1930s with 
the work of V. Bush.’ Lord Kelvin 
also invented a pulley device for 
solving simultaneous equations.” 
Larger versions were built by MIT 
professor Bohn Wilbur in 1934 and 
1935, An “isograph” was devel- 
oped at Bell Telephone Labora- 
tories. following a concept due to 
Thornton Fry in 1937. It could find 
the roots of polynomials of up to 
lGth degree, even if the roots were 
complex numbers. It was based on 
a Scotch yoke mechanism to trans- 
form from polar to rectilinear coor- 
dinates.” The state of the art of 
these and other computing mecha- 
nisms has been summarized as of the end of World War II 
by Macon Fry” and C'lymer."' 

These analog mechanisms. together with a “multiplier” 
(using slides and based on the mathematics of similar trian- 
gles) and a “resolver” (which produced R sin 9 and R cos 0 
from R and ® by means of a Scotch yoke mechanism). were 
among the building blocks for the practical computing sys- 
tems to be described. 


Naval surface fire-control computers of 
1910 to 1930 


It is necessary to describe a little of the technology of 
naval gunnery and fire control to present a snapshot of the 
state of affairs just before the entry of Hannibal Ford into 
the picture. What he accomplished was in direct response to 
the needs of the US Navy. He was responsible for the 
development of mechanical analog computers of unprece- 
dented size. complexity, dependability. ruggedness, and ac- 
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curacy. The mechanical analog 
computers of 1915 were, however. 
quite simple, small, and uncompli- 
cated compared with their descen- 
dants in the next three decades. 


The fire-control problem. In the 
nineteenth century the fire-control 
problem greatly increased in diffi- 
culty. Ranges had been 20 to 3) 
yards in 1800.’ ' Most of the engage- 
ment between the Monitor and the 
Mctrimac had been fought at 100 
yards. which was virtually point- 
blank range, and the ships were 
slow in maneuvers, affording gun- 
ners plenty of time to take aim.” By 
the end of the century, naval guns 
could fire at ranges far in excess of 
10,000 yards. Ships could move 
much faster, and still rolled and 
pitched to large angles in heavy 
seas, causing both sights and guns 
to move off target. 

With the increased ranges avail- from the disk center.) 
able to guns the problem of “spot- 
ting’ the errors in the locations of 
splashes of shells became more dif- 
ficult even in the clearest weather. 
Likewise. the task of determining 
target range became more chal- 
lenging. With the increased target 
range went a more than linear in- 
crease in the time of flight of a shell. 
so the target had more time in 
which to maneuver. Moreover. the 
greater time spent by a shell in flight 
enabled wind to have very impor- 
tant effects upon the impact point. 
Another complication was that ri- 
fling the gun barrels. while reducing 
random scatter. caused a systematic 
lateral “drift” of the projectile. 
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Figure 3. The Thomson integrator. (The displacement is perpendicular to the paper away 





which had to be compensated for in Figure 4. The Ventosa integrator. 


aiming the guns. 

The greater need for angular ac- 
curacy at greater ranges increased 
the importance of some relatively 
minor effects, such as variationsin atmospheric temperature 
and pressure. barrel erosion resulting from previous firing 
(which reduced the initial velocity and hence the range of 
the shell). propellant weight and temperature variations. 
projectile weight. and so on.'- The largest disturbances to 
accurate naval gunnery were the rates of change of range 
and target bearing due to relative motions of “own ship” 
(the firing ship) and the target. 

Clearly the crisis in naval gunnery created pressure to 
improve naval fire-control equipment. 


Fire-control equipment of 1910 to 1915. During World 
War | fire-control equipment included three classes of de- 
vices. |" 


Hevices alitt, Spotters’ scopes were used for viewing 
splasitcs in order to phone gun angle corrections (“spots”) 
relative to the line of sight. Optical range finders of succes- 
sively improved types determined range to the target. 
(American models had a base of 14 to 20 feet. but the British 
had only 9 feet. giving double the error. German range 
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Figure 5. A two-dimensional cam. (Photograph by Laurie 
Minor, Smithsonian Institution.) 


finders were the best because they had the best optics and 
thus the best view.) 

Directors. after about 1912.” consisted of sights kept 
aimed at the target in train and elevation in order to correct 
gun train and elevation angles for own ship roll and pitch. 
The English company Vickers had the lead in director de- 
velopment.” The US Navy purchased some of these direc- 
tors from Vickers for 4 inch guns. 


Devices befawships (in the “plotting room” or “control 
information center’). Gyrocompasses determined own_ ship 
course (purchased from the Sperry Corporation by the US 
Navy after 1910). Plotting boards were used for plotting the 
paths of own ship and target to determine range at the future 
time when the projectile would arrive (“advance range’), 
using range-finder data. The invention of the plotting board 
is ascribed to a junior gunnery officer in about 1906. 

Range clocks let operators set in the present rate of 
change of range to obtain a crude running estimate of range. 
“Time of flight clocks” told the time when a shell fired 
“now” would reach the target. The Argo clock was a me- 
chanical analog computer for solving the relative motion 
equations for range. As of 1912. the US Navy had a “lite- 
control table” (a mechanical analog computer) having input 
from the range finder and director. 

The pitometer log measured own ship speed. 








Figure 6. A three-dimensional cam. (Photograph by Laurie 
Minor, Smithsonian Institution.) 


Devices at the #ii%. Mechanical drives for guns appeared 
between 1907 and 1910. Manual tracking of command an- 
gles on dials positioned guns in train and elevation.’ Grad- 
uated sights on the guns had been used at the time of the 
American Civil War but were obsolete by 1910 or 1915. 


Differences between Britain and the US. The connectiv- 
ity of the primitive fire-control “system” composed of the 
foregoing fragments foreshadowed some aspects of modern 
fire control. However. there were differences among the 
systems used by different countries. For example, between 
Britain and the LS. there were differences in who controlled 
gunfire, from where. and with what use of the plotting 
room.” In the US Navy, the plotting room personnel con- 
trolled the fire, using data from spotters and their own data 
to compute gun angles. On the other hand, the British 
preferred optical system angular outputs. Director person- 
nel controlled the fire. using the plotting room information 
mainly to correct range. 

Thus the stage was set for the contributions of Hannibal 
Ford. 


The fire-control computers of Hannibal 
C. Ford 


Hannibal Choate Ford was born in Dryden, N.Y... an May 
8.1887. His parents were Abram Millard Ford (born Febru- 
ary 22. 1831) and Susan Agusta Giles Ford (born June 3, 
1834). 

As a young boy. Ford showed mechanical talent with 
clocks and watches. Between high school and college he 
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Figure 7, Hannibal C. Ford and his engineering staff about 1922. Ford is front and center; the others are unknown. (Photograph 
from the Sperry Gyroscope collection.) 


worked at the Crandall Typewriter Company, Groton, N.Y. 
(I 894). at the Daugherty Typewriter Company, Kittanning, 
Pa. (1896-1898), and at the Westinghouse Electric and Man- 
ufacturing Company (1898). 

He studied mechanical engineering at Cornell Univer- 
sity, graduating in 1903 as a “mechanical engineer in elec- 
trical engineering.” Evidently his classmates at Cornell re 
specled his mechanical inventive ability, because his motto 
in their senior yearbook was, “I would construct a machine 
to do any old thing in any old way.” He was elected to 
membership in Sigma Xi, the honorary society for research. 

After graduation Ford worked for the J.G. White Com- 
pany. New York (1903-19051, where he developed and held 
two basic patents issued in 1906 on the speed-control system 
long used in the New York subways. At the Smith-Premier 
Typewriter Company, Syracuse, N.Y. (1905-1909). he de- 
veloped over 60 mechanisms of commercial importance and 
received a number of patents over the period 1908 to 1915.” 

In 1909, Ford worked for Elmer A. Sperry. whom he had 
known as a young man in his home town, Sperry having been 
somewhat older. Ford assisted Sperry in the development 


of the gyrocompass, a mechanical device for determining 
own ship’s heading. The following year, Ford was promoted 
to be chief engineer of the newly formed Sperry Gyroscope 
Company, a position which he held until 1914," 

In 1915. Ford resigned from Sperry to organize his own 
company, the Ford Marine Appliance Corporation, which 
became the Ford Instrument Company in 1916 (see Figure 
7). The company’s mission was to develop and sell fire-con- 
trol systems to the US Navy. Its first product, Range Keeper 
Mark 1, was introduced into the US Navy in 1917 on the 
USS Texas. 

Ford’s Range Keeper Mark | (abbreviated Mk. 1) per- 
formed a remarkable number of continuous functions in real 
time for a computing system in those days: 


1. It generated range rate. 

2. By integration of range rate it determined present 
range. 

3. It generated the relative speed at right angles to the 
line of sight” but not the present target bearing 
angle.*°” 
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The rates were obtained by resolving own ship’s and target’s 
speed vectors along, and perpendicular to, the present line 
of sight. These operations required mechanical resolvers, 
differential gears, and an integrator. 

Ford’s integrator (Figure 8) was of superior design for 
achieving high accuracy and long life. It used two stacked 
balls, held by stiff springs, between a disk and cylinder, each 
made of hard steel. The balls were held in place by pairs of 
small rollers in a carriage. This design permitted the carriage 
to move even when the disk was not moving, a feature that 
was necessary when integrating with respect to a variable 
other than time. The author does not know if Ford was 
aware of the prior art, such as James Thomson’s integrator 
and William Thomson’s (Lord Kelvin’s) computer cancept ‘ 
before applying for his patent.°’ 

Own ship speed (measured from a pitometer log) and 
estimated target speed and course, own ship course (from a 
gyrocompass), as well as target bearing, were entered man- 
ually with the aid of dials, hand cranks. and knobs. The 
assembly of mechanisms was driven by an electric motor 
whose rotations represented the elapse of time. Present 
range, from the range finder, was telephoned to the plotting 
room, where the range keeper was kept. 

Meanwhile, Arthur H. Pollen, a British inventor, had 
devised a mechanism Of the differential analyzer type (called 
an “Argo clock’) to solve, on a continuous real-time basis, 
the relative motion equations for own ship and a target ship: 
“It accounted in large part for the extraordinarily good 
shooting of several Russian battleships during World War 
I.’ It was used also in the British Navy. Pollen’s invention 
must have preceded, by a short time, Ford’s range keeper. 

During World War I, the US Navy obtained the patent 
for the British Pollen fire-control computer system (Argo 
clock), and the Range Keeper Mark 1 was modified to 
incorporate one of Pollen’s concepts (dividing by the range 
and integrating with respect to time to get the bearing 
angle). By dividing relative motion across the line of sight 
by present range, the Ford range keeper (called apprecia- 
tively the “Baby Ford’) was able to generate the rate of 
change of target bearing and integrate it to get the target 
bearing angle, which in turn defined the line of sight. Thus 
the range and direction to the target could be generated and 
known, even if the target was lost from sight for a while. 
These modifications introduced another integrator and a 
divider into the evolving range keeper.!? 

Another of the early additions to the Baby Ford was a 
ballisticcapability.” It was to determine the time of flight of 
the shell to the predicted point of impact, the bearing of that 
point, and the range of that point. Then the gun angles could 
be calculated to implement that prediction. The guns were 
steered by hand (following pointers), but they were powered 
by Waterbury Speed Gears (hydraulic drives). 

Another capability was “rate control.” This function en- 
abled determining corrections to target speed and course as a 
result of data obtained from spotters aloft regarding the splash 
locations relative to the target. The Baby Ford had a rudimen- 
tary scheme for doing this, but it required the prediction calcu- 
lations to be stopped while rate control was being done. Han- 
nibal Ford earned a patent for his rate control scheme. 


By the end of World War I, the Ford range keepers 
provided a serviceable nucleus for a partially mechanized 
fire-control system. It was roughly comparable with the 
British system. The British gun directors were deemed bet- 
ter than those of the US Navy, but British range finders, 
having a smaller baseline, were inferior in accuracy. The 
Pollen Argo clock and Baby Ford were about a standoff.” 
Acceptance of the Baby Ford was not universal and imme- 
diate. Some senior fleet officers tended to resist it, preferring 
the plotting boards, where they could “see” the situation at 
a glance. 

In addition to developing range keepers, Hannibal Ford 
almost single-handedly developed an entire gun director. It 
included an optical turret, a stable element to establish the 
vertical on a rolling and pitching ship, an angle gyro pointing 
at the target, and the associated Baby Ford range keeper, 
which included a ballistic computer. 


Naval fire control from 1930 to 1950 


In the 1920s the international clamor for disarmament 
forced the US Naval budget to a very low point. Although 
the situation improved in the 1930. when the US Navy 
began again to grow, money was still tight. The Bureau of 
Ordnance was forced to drastically limit what it could pro- 
cure. A striking example is offered by the deck tilt corrector 
that was, in the 193fs, ordered by the bureau to be devel- 
oped by Ford Instrument Co. Unfortunately, there was only 
enough money to order half of the desired corrector. During 
part of that period Ford Instrument Co. was down to a 
three-day week for its employees. 

In the late 1920s, Hannibal Ford began developing the 
first antiaircraft (AA) fire-control system, including both a 
director (Mark 19) and a range keeper. Because of the 
target’s ability to maneuver at high speeds and angular rates 
as seen from own ship, the AA fire-control problem was 
intrinsically much more challenging than was fire control for 
a surface target. Despite the work on AA fire control, 
systems for surface fire control continued to pour from the 
Ford Instrument Co. under Ford’s technical direction. For 
example, the company developed the Range Keeper Mark 
8, which was used in the Marks 24, 31, 34, and 38 Gun 
Directors. Equations and a schematic diagram of informa- 
tion flow in the Range Keeper Mark 8 have been published 
in the open literature, although values of constants in the 
equations were not given.“.” 

The period starting in 1930 saw the introduction of many 
improvements in fire-control systems. One was automation of 
data input into the computer. Friedman’”’ provides the follow- 
ing list of data entered manually in 1933 range keepers: 

Variable Source 
Range Phoned from range finder 
Own ship course Gyrocompass repeater 
Own ship speed Pitometer log 
Target course Initial estimates for rate control 
Target speed Initial estimates for rate control 
Target bearing Automatically from director 
Spotting data Spotter, by telephone 
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Figure 8. Hannibal Ford’s integrator. (Photograph by Laurie Minor, Smithsonian Institution.) 


By the late 1930s the input of these variables was much more 
highly automated. 

The Gun Director Mark 33 was initiated in 1932 for 
dual-purpose 5-inch/38 guns on ships of all sizes. It resem- 
bled an apple on a stick when it was mounted aloft, and it 
had vibration problems. It was used with the Ford Range 
Keeper Mark 10 for antiaircraft fire, and it had a stable 
element and a computer below deck. A total of nearly 850 
Mark 33s was eventually installed. 

A typical World War II range keeper or computer con- 
sisted of three sections: 


|. Tracking section (the original range keeper functions 
dealing with relative and absolute motions of own 
ship and target). 

2. Prediction section (predicting range and time of flight, 
each from two moving time origins: the time of gun 
firing and the time of fuze time setting; and the re- 
quired gun angles found by considering the ballistic 
functions and wind). 

3. Correction section (calculating and applying correc- 
tions due to own ship angular motions. namely. roll 
and pitch. requiring trunnion tilt and deck tilt correc- 
tions to the gun angles). 


By the time of World War If most main battery fire 
control was done by Range Keepers Mark 8 in Directors 





Mark 34, mainly for cruisers, and Directors Mark 38, for 
cruisers and battleships.!* The Ford range keepers were 
superseded by the Ford Computer Mark 1 in the Gun 
Director Mark 37. This director was first tested in 1939 and 
it quickly became the standard dual-purpose director in 
World War II, although many Range Keepers Mark 10 in 
Directors Mark 33 also were built and used. The Bureau of 
Ordnance considered the Computer Mark 1 to be “enor- 
mously — successful.“”’ The system included transmission of 
data to and from the computer below decks by means of 
synchros. Designed originally for the 5-inch/3& guns, it was 
soon modified by Ford Instrument Co. for a number of other 
guns and ammunition types as well. 

Choice of the term “computer” in preference to “range 
keeper” recognized the growing inadequacy of the term 
“range keeper” to describe the system. Keeping range was 
a small part of its function. 

Fine as this fire-control equipment was for 4-invh guns 
and up, it was not suited to the smaller guns and decentral- 
ized control that proved necessary in World War II for 
defense against incoming aircraft in large numbers. More- 
over, the large fire-control systems were not economically 
feasible for use on small naval vessels and merchant ships 
having guns even as large as 3 inches. Fire control for 
close-in attack by a number of aircraft was “sadly neglected 
in the years between the two wars” due to an “ill-founded 
complacency” concerning the ability of fire-control systems 
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of the day to destroy all tar- 
gets at greater ranges.** The 
Japanese exploited this 
weakness with several dis- 
tinct modes of attack. 

Ford Instrument Co. was 
caught up in the rush by the 
Bureau of Ordnance to de- 
velop fire-control systems to 
meet these new needs. Ford 
developed, to various ex- 
: fl tents, the Gun Directors 

Fe ie, Mark 45, 48, and 49 — all 

ee a intended for close-in AA fire 

Figure 9. William H. Newell with small guns. The Mark 49 

in 1988. used a gyro to determine lead 

angles based on the preces- 

sion rates measured in track- 

ing the target. It was ready by late 1942, and nearly 350 were 
eventually delivered. 

Ford’s answer to the merchant ship problem was the 
Computer Mark 6 -used with Gun Directors Mark 52 and 
53. Although only about the size of a large wheel of cheese, 
it ingeniously contained a simplified capability for solving 
the surface fire-control problem. 

In spite of all these developments with gyros, reticules, 
and lead computers, they only partly replaced the old open 
sight in World War II. Gunnery and fire-control system 
designers had prepared for a different enemy — one more 
like a towed target remaining at a distance of miles.’ 

Optical range finders gave way to radar in the late 1930s 
and early 1940s. This resulted in a substantial increase of 
capability of searching for targets (with “broad-beam search 
radars”) and tracking targets (with “narrow-beam fire-con- 
trol radar’). No longer was it necessary to illuminate a target 
with star shells at night or lose a target in mist. Moreover, 
the range, target bearing, and elevation signals were cleaner, 
smoother. and more accurate. The measurement of range 
and target direction angles had been freed from the limita- 
tions of the human operator of an optical range finder. The 
advancement of synchros for transmitting and receiving 
data in fire-control systems was a step away from manual 
follow-the-pointer systems. These synchro systems are de- 
scribed in Department of Ordnance and Gunnery publica- 
tions.” 

A few problems existed because the Bureau of Ordnance 
had to deal with other bureaus in getting its equipment 
installed. For many years — until 1943, in fact — the gun 
mount foundations provided by the Bureau of Ships did not 
meet specifications of the Bureau of Ordnance.‘* Presum- 
ably the accuracy of gunnery then improved somewhat. 

One of the most valuable advances was the development 
(about 1940) of powerful control systems for automatic 
training and elevating of guns of all sizes. After the installa- 
tion of automatic control, the guns could fire with precise 
aiming at any time, freeing gunnery from the centuries-long 
dependence on synchronizing firing with rolling of the ship. 
Although the earliest systems were susceptible to oscilla- 
tions and lags.’ improvements in the mathematical design of 





control systems, and (according to William Hampton, then 
a Ford employee) the use of steel piping for greater hydrau- 
lic stiffness, resulted in satisfactory performance. 

Another advance, the “proximity fuze,” made it possible 
to avoid having to set fuze time and incurring the associated 
errors of burst time. Projectiles could be loaded directly and 
fired immediately, and this allowed gunnery accuracy to 
improve even further. 

The entire functional environment of fire-control com- 
puters had to evolve to keep pace with the increased sophis- 
tication of the other components. 


Evolution on the system engineering 
level 


A respectably mature discipline of system engineering 
had developed in naval fire control by the late 1930s and, 
from that time on, the days of the inventor left to his own 
judgment were gone. 

One evidence of system engineering was the standard set 
of symbols that came to be used in equations to designate 
variables, such as ff for time of flight and R2 for advance 
range. Likewise, there was a standardized vocabulary of 
concepts such as “advance range” (the range at time of 
predicted impact) and “time of flight’ (the time from firing 
to impact). As more and more corrections were incorpo- 
rated in the range keepers, even the equations took an 
increasingly standard form which was then imposed by the 
Navy across all manufacturers. Some of these equations are 
given by l'riedman'? and the 1941 US Navy Academy 
book.” 

Another evidence of the use of system engineering is the 
top-down generation of specifications, beginning with the 
Bureau of Ordnance, with the manufacturers going into 
greater detail in the specifications. This procedure resulted 
in the systematic production of schematic diagrams, engi- 
neering drawings. training manuals, and other documenta- 
tion. 

Another hallmark of system engineering was the analysis 
of system performance errors: For each Ford Instrument Co. 
product there was calculated a full complement of “class B 
errors.” These were the deviations of the system’s answers 
from theoretical answers calculated from the exact equa- 
tions for specified cases. Analysis of these errors led to 
knowledge of where more accurate calculations were 
needed in the product. The next step was to develop an 
“error budget” that allocated allowable errors among _ all 
contributing categories in a hierarchy. The error budget 
pointed to novel developments needed as well as to limits 
on errors of conventional equipment. 

Yet another aspect of system engineering was the analy- 
sis of errors of the enemy’s system, seeking weaknesses to 
exploit. By whatever means were used. the Japanese iden- 
tified opportunities for dive bombers, torpedo planes, toss 
bombers, kamikazes, and so on. These tactical weapons 
presented the ships’ fire-control systems with short-range, 
high-range rate, and/or high bearing and elevation rates, 
where the accuracy of the Gun Directors Mark 33 and 37 
fell off sharply.‘* That low performance is in contrast to the 
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reported high accuracy with slow 
targets, even at great ranges. (For 
example, the battleship Washing- Characteristic 
ton is said to have achieved nine 


hits on the Japanese battleship eppucanee 
Kirishima, out of 75 rounds of 1f- 
inch shells at 19,000 yards range in 
the night battle of Guadalcanal in Pa vaneament 
1942, where radar was used.) 

Construction 


The contributions of 
William H. Newell 


In 1926 the Ford Instrument Problem size 
Co., which was then working on its 
first antiaircraft director, got a new 
employee: William H. Newell, aged 16. He worked first in 
the shop making high-precision mechanical computing com- 
ponents and, a year later, transferred to the Test Depart- 
ment where he acquired the techniques of making mechan- 
ical analog computers perform to their limits. In the 
evenings for seven years he went to the College of the City 
of New York to study engineering. He advanced rapidly as 
a result of his nearly unique talents as an inventor, designer, 
and developer of mechanisms and indeed, like Hannibal 
Ford, entire computing systems. In 1943. at age 32, he 
became chief engineer. 


Design style 


Newell’s inventions. Newell (see Figure 9) has received 
80 patents in connection with his work. The subject matter 
was long classified, so the public has not known of his 
contributions. Any attempt to determine Newell’s accom- 
plishments by concentrating on patent dates is difficult be- 
cause the date of filing for a patent might have been much 
earlier than the date of issue due to secrecy orders prevent- 
ing responsive issue. 

Among Newell’s mechanical. hydraulic. and electrical 
inventions (see Appendix) were 31 devices of fundamental 
importance to analog technology. Included are devices such 
as a hydraulic computer: an irreversible drive involving 
wedges to lock two disks if direction starts to reverse, as in 
back torque from gun recoil: a torpedo director (Mark 2); a 
director for defense against horizontal bombing runs; a 
scheme for using trains of balls, with wheels and steering 
rollers, to integrate complicated trigonometric functions 
and solve the fire-control tracking problem; and a comput- 
ing device for predicting the deck angles of an aircraft carrier 
at the instant an airplane would be landing. 

Many of these inventions concerned ways to deal with 
inertia and friction loads on the driving mechanisms. They 
were essentially servos, then usually called “follow-ups.” 
that provided torque amplification while following a shaft 
angular position signal. These servos had a differential gear 
for comparing the output angle of the servo with the input 
signal angle, producing an error angle, which determined 
the signal to the drive to reduce the error-that differential 
gear was represented on schematics by a cross in a circle. a 
symbol which is still used on schematic diagrams for the 


Differential 


Solution of arbitrary differential 
equations sets (general-purpose 
computer) 


On “solid ground” in a building 


Originally spread out on a large 
breadboard for flexibility 


Laboratory 
practice 


Several differential and 
algebraic eouations 


Table 1. Differences between differential analyzers and fire-control computers. 


analyzers Fire-control computers 


Computing continuous aiming 
and fuzing of naval guns 


In a moving warship 
experiencing severe shocks and 
vibrations 


Designed into minimum volume 
for shipboard use 


Rugged, yet precise machine 
design 


instrument design 


Many differential and algebraic 
eauations 


error-determining subtraction in control systems of many 
types today. 

The Ford Instrument equipment often used an “intermit- 
tent drive,” a device that enabled one part of the equipment 
to drive another over only a limited part of its total travel. 
Ford had designed the first intermittent drive, but Newell 
improved the design, putting the whole drive on one shaft. 


The significance of N ewell’s work. One of the hallmarks 
of Newell’s work has been that he took extra trouble to find 
the neat and simple way to do things, rather than go ahead 
with his first idea. A notable testimony to Newell’s and Ford 
Instrument’s skills was that Wernher von Braun selected 
them to build the mechanical and gyro guidance system for 
the first Redstone missile. Ford Instrument Co. built also 
the guidance system for the Jupiter missile. 

Newell’s work was done with originality and self-reliance. 
One might wonder if he got ideas from other organizations in 
those days of technical ferment. However, Newell has denied 
that he got ideas from MIT’s differential analyzers or Servo 
Lab work: In fact. MIT bought Ford components, and Newell 
believed that Ford Instrument was “ahead.” According to 
Newell. Bell Telephone Laboratories, the Naval Research 
Laboratory, the Office of Naval Research, the ENIAC project, 
and the university researchers, including such avid communi- 
cators as John von Neumann, Harold Hu#en, Jay Forrester, 
Claude Shannon, Marberl Wiener, Warren Weaver, and 
Vannevar Bush, had no effect upon his work. 

From 1965 to 1977, Newell worked for Perkin-Elmer, in 
Norwalk, Conn.. on challenging projects such as the space 
telescope, first on the senior technical staff and then as a 
consultant. But that is another story worth telling. 


Other mechanical analog computers 


At this point in the story, attention is turned from fire 
control to other specialized applications of mechanical ana- 
log computers. The author makes no attempt to describe the 
type generally known as a “differential analyzer” because it 
is already adequately described in other places -except to 
distinguish it from the computers used in fire control. Dif- 
ferential analyzers differed dramatically from _fire-control 
computers. as shown in Table 1. 
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These were the two distinct species that represented the 
high point of mechanical analog computer development, 
each in its own way. Williams’ felt that *‘the analog tradition 
reached its height in the differential analyzers.” This author 
disagrees that either species was superior. 


Torpedo mechanical computers. Torpedo data comput- 
ers for use by submarines were developed by the Arma 
Corporation in 1935. Arma had been building stable ele- 
ments and other gyroscope instrumentation for weapons 
since its founding in about 1920. The torpedo data computer 
automated much of the process of inserting data into a 
torpedo to establish its course, speed, and depth. It was 
primarily a mechanical computer with some electrical com- 
ponents. By World War II most submarines in the US Navy 
had a TDC Mark 3.'’ A simpler and more compact version 
of the torpedo data computer, the Mark 2, was developed 
by William Newell (see item 5 in the Appendix). 

Destroyers of that period carried Torpedo Director 
Mark 27, which contained a mechanical computer. A num- 
ber of approximations could be made, because the resulting 
errors could be ignored when torpedoes were fired in a 
spread. As a result, the equations were much less complex 
than those of the antiaircraft fire-contrel problem." As early 
as 1942, the Bureau of Ordnance conceived of a need for a 
system for computing and displaying the data of concern in 
antisubmarine warfare. The resulting product was the At- 
tack Director Mark 2, which contained a mechanical com- 
puter. Fifteen were delivered.” 

In the early 1950s, Arma built a mechanical analog com- 
puter (“coordinate conversion computer’) containing a 
gimbal system. Designed at MIT, it was one unit of a fire 
control system for use by the Navy in the Korean War. The 
torpedo itself contained several small mechanical analog 
computers. They were extremely delicate and complex, with 
the result that their effectiveness was reduced. These com- 
puters included the following mechanical devices: 


1. The course control system that activated a rudder. 

2. A computer to determine the course angle for colli- 
sion with the target. 

3. A depth-control system, relying on a diaphragm to 
measure depth (water pressure) and a pendulum to 
measure rate of change of depth. The pendulum was 
later replaced by a gyroscope to avoid the error due 
to longitudinal acceleration. The change was Newell’s 
idea.” (See item 23 in the Appendix.) 


Bombsight mechanical analog computers. Another 
highly specialized type of mechanical analog computer was 
developed for use in bombers. Bombsights were remarkable 
for their extremely small size and high precision. The 
Norden bombsights contained over 2.000 parts. Develop- 
ment began at the end of World War I and progress was 
rapid: The Bombsight Mark 3 was contracted for in 1922. 
the Mark 11 was accepted in 1931, and the Mark 15 was 
being tested in 1931.” Bombsights were also made by 
Sperry. 


One of the refinements to bombsights was the invention 
by Newell and Lawrence Brown that enabled a bomber to 
navigate by some identified visible point, when the target 
itself was obscured, and yet still bomb the target. 


Sights and directors for small guns. Major naval vessels 
had no small guns until after Pearl Harbor, when the large 
numbers of incoming aircraft had overwhelmed the fire-con- 
trol systems for large guns. As a result, a rapid evolution had 
to take place to provide something better than the open sight 
mounted on the gun barrel, which had been standard arma- 
ment against aircraft in World War I. 

A significant advance was made by the lead-computing 
sight developed in the 1930s by Charles S. Draper of MIT. 
Draper’s sight evolved from his earlier products of an air- 
craft instrument to display rates of turn and his tank gun 
sight. These devices used precessing-rate gyros mounted on 
the line of sight to the target, Each rate was multiplied by a 
suitable factor to produce a proportional lead angle, which 
was applied to the gun direction.” The overall precision was 
on the order of 2 percent. The Navy learned of the Draper 
sight belatedly: One was tested in July 1941, and the sights 
entered service in the fall of 1942 — built by Sperry and by 
Crosley. Eventually 85,000 of the Gun Sights Mark 14 were 
bought for naval vessels. 

The US Navy’s response to the need also included the 
development of some heavy machine-gun directors. Con- 
tracts for development were awarded to Ford Instrument 
for the Gun Director Mark 45, to General Electric for the 
Mark 46, and to Atma for the Mark 47 (the Mark 46 and 
47 never reached production). The Mark 45 was com- 
pleted as early as 1942: however, it was too complicated 
and heavy as a computer, and it was too crowded as a 
workplace, so production of it was stopped. It was re- 
placed by the Gun Director Mark 49, which also was 
being developed by Ford Instrument. The Mark 49 con- 
tained a gyro torqued hydraulically to firecess it, and it 
had hydraulic pick-offs. The Mark 49 was replaced by the 
Mark 51.‘* Located on a pedestal remote from the guns, 
it used a Draper sight to transmit train and elevation 
angle orders to heavy machine guns. It was manufactured 
by Sperry Gyroscope Co., beginning in January 1942.'- Its 
performance was poorest for surface targets, which had 
small angular rates as seen by the sight. 

Gun Director Mark 56 was designed at MIT. It utilized 
an unusual mechanical analog computer technology: [vur- 
bar linkages. By properly proportioning the bar lengths, one 
could design linkages to generate a surprising variety of 
functions. Some of the linkage computers were made by 
Ford Instrument Co. Vannevar Bush, in his role as one of 
the organizers of the National Defense Research Commit- 
tee. was able to do much for small gunfire-control develop- 
ments. and he had a hand in its production. 

In addition to the naval gun sights and directors men- 
tioned here for heavy machine guns. comparable or smaller 
systems were developed for use in aircraft, such as the 
largest bombers (B-29). There was. for example, a Mark 18 
Turret Gun Sight. which had a computing mechanism. It was 
followed by the Mark 23 in 1945,"" 
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Other analog mechanical computers. Flight simulators 
for pilot training have been in existence since the “Pilot 
Maker,” alias “Blue Box,” of Ed Link, developed in 1{¥29,/* 
Link’s flight simulator contained a pneumatic analog com- 
puter that used principles he had learned in his father’s 
organ factory. A mechanical analog flight simulator was 
designed and built by Ford Instrument Co. in 1944, Later 
flight simulators were based on electric and electronic ana- 
log and then digital technology. Mechanical analouy commpul - 
ers were used also in early guidance systems for missiles: 
‘rma did the inertial guidance for the Atlus missile. William 
Newell also invented a guidance system that worked without 
gimbals, integrating components of acceleration and veloc- 
ity to determine present position (see item 27 in the Appen- 
dix). 

The range of the German V2 rocket was determined by 
a mechanical analog computing device. It integrated accel- 
eration twice to get distance traveled; it also contained some 
linkages and differential gears to relate the twice-integrated 
acceleration to horizontal distance.” As the technology was 
refined, new applications were undertaken. Most of these 
and other mechanical analog computers were eventually 
superseded by electrical analog computers. 


The descendants of mechanical analog 
computers 


Mechanical analog computing evolved in two directions, 
branching into developments in AC analog computers and 
DC analog computers. 


AC analog developments. In about 1940 the market for 
tools for performing mathematical operations was quite 
small. Mechanical desk calculators served acceptably for all 
but the largest problems, such as fire control and exterior 
ballistics. When Thornton C. Fry wrote a survey article’” 
about the extent of the use of mathematics in industry, he 
had little to report outside the telephone and aircraft indus- 
tries. One could not then imagine the explosion of electrical 
and electronic technologies that would result in a flood of 
computers available at modest cost. 

The principles of AC (alternating current) electrical an- 
alog circuits had been known since Steinmetz in the 1880s. 
Currents entering a node were known to add. The charge 
on a capacitor was known to be the time integral of the 
current that had flowed through it. It was known that a 
servo-driven potentiometer could be “tapped” to yield a 
function or a product of two variables. ihis technology was 
not developed, however, until Bell Telephone Laboratories 
found application for it in a developmental gun director 
early in World War II. 

The BTL project was to develop an AC analog gun 
director, the T-15. It was funded in November 1941, and the 
model was completed a year later and tested in December 
1942.” The T-15 was never put into production: it was, 
however. used for research with targets flying trajectories 
that were not straight lines. 

The T-15 led to a proposal to the Navy, in February 1942, 
to construct an AC analog version of the Ford Instrument 


Company’s Computer Mark !. A contract was awarded in 
September 1942 for development of this “Mark 8 Com- 
puter.” Although it proved to be faster than the Computer 
Mark | in completing the initial transient of acquiring and 
locking onto a target, the Mark 8 Computer was never 
produced. It had one other feature worth noting: a special 


A refinement to bombsights invented 
by Newell and Lawrence Brown 
enabled a bomber to navigate by a 
visible point, when the target itself was 
obscured, and yet still bomb the target. 


electrical integrator that was developed for it. 

Ford Instrument Co., under the direction of Harry Mc- 
Kenny and William Newell, developed an AC analog com- 
puter, the Mark 47, which replaced the mechanical analog 
Computer Mark 1. 

From 1945 to 1950 the Dynamic Analysis and Control 
Laboratory at MIT developed an AC analog computer, 
using ‘Will-cyele AC components in a guided missile flight 
simulator. This was an activity within Project Meteor. The 
flight table was mounted on four concentric gimbals so 
driven as to avoid gimbal lock under all conditions. 


DC analog developments. DC (direct current) amplifiers 
had been used since the post-World War I days of radio. 
They were highly developed in the 1930s by BTL, which 
used them for signal amplification in telephony. They were 
used also by George Philbrick at Foxboro, as early as 1937 
or 1938, for simulation of linear processes and control sys- 
tems.” Developments of amplifiers for use in simulation 
were made also by John Ragazzini et al. at Columbia Uni- 
versity in about 1940. Bell Telephone Laboratories devoted 
itself to the development of DC vacuum tube amplifiers for 
use in analog computers for fire control after about June 
1940. A patent. applied for in May 1941, was issued in June 
1946 as US patent 2404387 to C.A. Lovell, D.B. Parkinson, 
and B.T. Weber. Their contemplated systems used summing 
networks, potentiometer cards for functions, and an integra- 
tor using an amplifier and a capacitor. 

In November 1940 Western Electric received a contract 
to develop a model of a DC analog gun director, the T-10. 
It was to use the BTL-developed DC analog technology. 
The model was tested successfully in December 1941.” 

The success of the T-10 led to a contract to build the 
production version, the M-9 Gun Director. It was delivered 
in December 1942, and it was placed in service in early 1943. 
It was used during the Vl “buzz bomb” attack on London 
to control the fire of 90-mm guns located along the English 
coast. During the month of August it shot down 90 percent 
of the buzz bombs that arrived, and in its best week it shot 
down 89 of the 91 that arrived. The M-9 (see Figures 10 and 
11) was aided by radar and proximity fuzes.-' A British 
version of the M-9 (the T-24. directing 4.5-inch AA guns) 
had its prototype completed by May 1%42.7' 
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Another offspring of the M-9 was the “M-8 Gun Data 
Computer,” which BTL developed for the US Coast Artil- 
lery Board for control of 6- to 8-inch guns firing at surface 
targets. The M-8 corrected for the parallax angles of differ- 
ent guns firing at the same target and also corrected for the 
earth’s curvature. It was never used in combat, because 
there were no targets for it.*’ 

Lest it be gathered that all electronic analog developments 
in World War II were made by Bell Telephone Laboratories, 
note that the Arma Corporation developed, starting in the 
summer of 1940, an electronic analog antiaircraft computer for 
the Mark 47 Gun Director. It was to control 40-mm machine 
guns, but in 1941 it was changed to the 3-inch gun and was to 
be incorporated in the Mark 50 director. Deliveries of 43 units 
began in May 1943, but the computer had some serious diffi- 
culties: It weighed too much, and it was too complex for feasible 
mass production and for ease of maintenance. The system was 
further complicated by the fact that the electronic ballistic 
converter and [uz¢ order computer had to control 40-mm. 
L.Janch, 3-inch/S0), and 5-inch/38 guns.” 

The promise of BTL’s early electronic analog gun direc- 
tors encouraged other computer developments in World 
War II. One, the AN/APA-44, was a bombing and naviga- 
tion computer for aircraft. BTL also developed electronic 
analog flight simulators for pilot training for the PBM-3 
Martin Mariner patrol bomber, the Grumman Hellcat 
fighter, and the Consolidated Privateer patrol bomber.’*t 

After World War II, Project Cyclone was established to 
develop a DC analog computer for general-purpose appli- 
cations. The work was done by the Reeves Instrument 
Corporation. Very soon there were competitive commercial 
products available from Electronic Associates, Inc., Applied 
Dynamics, Inc., and eventually about 30 more companies. 


Figure 10. M-9 gun director in action. The tracking unit with its two operators is in the foreground, while the computing units 
are in the truck, 


These “analog computers” became the tools of choice for a 
generation of control system designers, missile and aircraft 
designers. and analytical engineers in all branches of engi- 
neering for purposes of dynamic and often real-time simu- 
lation. These developments left the AC analog computers 
far behind in accuracy and other performance features. One 
of the key steps was chopper-stabilization of the DC ampli- 
fiers, which otherwise had a maddening drift. 

One of the people who worked almost anonymously 
behind the scenes in this period was Perry Crawford at the 
Naval Special Devices Division. He had a hand in the ad- 
vanced thinking underlying Project Cyclone. He also had 
some influence upon the course of Project Whirlwind, an 
early digital computer developed at MIT which is best re- 
membered for its magnetic core memory by Jay Forrester. 
Crawford had written two provocative theses at MIT*"** 
which contributed to the frontier thinking of the time toward 
electrical digital computers.’ 


The defeat of mechanical analog 
computers 


The beginning of the end for mechanical analog computers 
as the computers of choice in fire-control systems began just 
before World War II. They were then at their zenith. No 
competition was in sight. yet the computers that would replace 
them in less than a decade were already in development. 

Mechanical analog computers for fire control were much in 
demand as a result of the rapid growth of the US Navy in those 
days. Accordingly, the Bureau of Ordnance was anxious that 
Ford Instrument Co. might not be able to manufacture them 
fast enough to meet the need. There were critical skills, ma- 
chine tools. and materials that were in short supply, any one of 
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which could have produced a fatal bottleneck. It was only 
prudent that the Bureau of Ordnance then sought alterna- 
tives on a second-source-of-supply basis.*' 

The government’s expenditures for electrical and elec- 
tronic analog computers for fire control and aircraft simula- 
tion have been mentioned. This flow of money sufficed to 
fund the necessary research and development. The sudden- 
ness of the emergence of electrical and electronic analog 
computers is easily attributable to the equally sudden 
awareness of a need. 

It seems plausible that the lack of such funding and 
procurement desire in the previous years was _ responsible 
for the relative stagnation of electrical and electronic ana- 
logs. This stagnation existed in spite of the almost-ready 
availability of virtually all of the required electrical and 
electronic analog components. One of the reasons for the 
Stagnation is that the mechanical analog people believed 
firmly that no electronic computer could survive the on- 
slaught of the shipboard shock and vibrations in battle upon 
vulnerable vacuum tubes and solder joints. Probably this 
thinking also kept electrical components, except the sturdy 
servos and synchros, out of mechanical analog computers. 

No one had realized the cost in battle due to the sluggish- 
ness of even the fastest mechanical computers in converging 
upon a target. This discovery was not made until speedier 
electrical analog competitors were developed and demon- 
strated. However, once discovered, this feature of the elec- 
trical analogs proved to be essential in dealing with a multi- 
plicity of very fast aircraft and missiles as targets. 

Another reason for the lack of effort to develop electrical 
analog computers until just before World War II was that 
the required parts (resistors, potentiometers, and capaci- 
tors) lacked sufficient precision for fire control. The neces- 
sary precision was, however, developed when the need 
materialized. 

During World War 11 the electrical analogs were on the 
scene and were being rapidly developed with funds diverted 
from mechanical analogs. Moreover, with production came 
cost reductions for electrical analog which could not be 
matched by the precision mechanical computers. Similarly 
the size and weight of electrical analog computers came 
down rapidly to be more than competitive. The scales were 
tipping in favor of the electrical analogs. By the time they 
tipped all the way, it had been a sudden process over only a 
few years. The shift of contracts to electrical analog com- 
puter manufacturers and the general reduction in level of 
postwar spending crippled the manufacturers of mechanical 
analog computers. 

Mechanical analog technology died back but has not, 
even yet, died out. It is still in use where precise mechanical 
results are required, such as in very large telescopes, printing 
presses, and movable antennas. Mechanical analog technol- 
ogy survives also in many more subtle ways. For example, 
the “schematic diagrams” of mechanical analog computers 
evolved into “analog diagrams” for DC electronic analog 
computer problems or systems (in general- or special-pur- 
pose computers, respectively). Similar diagrams are often 
used in control engineering, digital computer simulation 
technology, and Forrester’s “system dynamics.” The pres- 





pot 


Figure 11. M-9 gun director (covers off). 


ent trend toward massive parallelism in digital computers 
also will continue the need for the analog type of diagram 
well into the future. 


The short reign of electrical analog 
computers 


While the AC and DC analog computers were replacing 
mechanical analog computers. their own eventual succes- 
sors -— the digital computers -were appearing and growing 
in capability. Since that story is well documented in the 
Annals of the History of Computing, it is not repeated here. 
Suffice it to say that electrical and electronic analogs had a 
much shorter reign than mechanical analogs. From Ford’s 
Range Keeper Mark 1 to the virtual stoppage of production 
of mechanical analog computers in the 1¥ASUs there was a 
reign of about 40 years. The electrical and electronic ana- 
logs, however, reigned supreme only about 10 years before 
they were surpassed and replaced by digital technology. 


large measure of the historical importance of mechan- 

ical analog computers stems from their service in naval 
fire-control systems from World War I to somewhat beyond 
World War II. Much of the credit for US naval fire-control 
systems stems from the design and performance of the Ford 
Instrument Company’s mechanical analog computer prod- 
ucts, including developments from Range Keeper Mark 1 to 
Computer Mark !. These computers were superbly accurate 
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despite their need to be rugged under the abuse of shocks 
and vibration in battle. 

The outstanding inventors and developers of the Ford 
Instrument computers were Hannibal C. Ford and William 
Newell. Their technical leadership, which spanned four de- 
cades, provided a unique corporate capability. 

Ford and Newell deserve to be recognized as mechanical 
geniuses at least on a par with Vannevar Bush. Bush has 
become the better known by far, because of his differential 
analyzers, because of his writings, and because of his visibil- 
ity as an administrator on the national level. In contrast, 
Ford and Newell worked exclusively on classified projects 
unknown to the public, modestly wrote nothing, and were 
administrators only within the company. They let their in- 
ventions and developments speak for them. 

It is unfortunate that the story of Ford and Newell has 
not been known and appreciated among engineers and the 
general public. The US Navy has had the facts all along, but 
it could not speak for many years because of the need for 
secrecy. The material could not be declassified until it no 
longer had current military importance. As a result, only 
those who were involved in the work have been privy to 
much of the story. 

Likewise, in the author’s opinion. mechanical analog 
computers for naval fire control deserve a featured place in 
the history of computing, as differential analyzers have 
enjoyed. 

The outlook for future mechanical analog technology is 
confined to some highly specialized opportunities where its 
advantages outweigh its disadvantages. These opportunities 
are most likely to arise for one or two components rather 
than complete computers. The glory lies in the past. 

Thus, the story of mechanical analog computers deserves 
a place in the history of computers. It is truly important in 
its own right and, in addition, the technology served as an 
early stepping stone toward today’s digital computers. UH 
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Appendix 
Among Newell’s mechanical. hydraulic. and electrical 
inventions were the following: 


1. A hydraulic computer. plus some hydraulic compo- 
nents, such as a device to generate a hydraulic pres- 
sure proportional to a displacement, and a hydraulic 


ea 


10. 


ll. 


torquer and pick-off for a gyro in a gun director. 
Patents 2317293. Apr. 20, 1943: 2405052, July 30, 
1946: 2483980, Oct. 4. 1949: 25 13888, July 4, 1950: 
2533306. Dec. 12. 1950: 2550712, May 1, 1951; 
2504571. Oct. 2. 1951: 2766587, Oct. 16, 1956. 
Various rotary damping and/or inertia devices to be 
attached to a servo shaft to smooth the mechanical 
output with a low-pass filter. One of these, called a 
‘*k-motor,” acted only when the signal got rough, 
Patent 2400775. 

Poitras and Tear of Ford Instrument developed an 
arrangement making a follow-up motor’s speed pro- 
portional to error. thereby obtaining an exponential 
characteristic. making it a ‘velocity-lag servo.” This 
used a drag cup and gave an error proportional to 
velocity. To eliminate this error there was intro- 
duced a differential gear between the motor and 
drag cup with an inertia on the other differential 
input. which gave a smaller error proportional to 
acceleration. but no error proportional to velocity. 
Newell. in one application. used an air dashpat to 
obtain the velocity-lag servo ettect. 

An irreversible drive involving wedges to lock two 
disks if direction starts to reverse. as in back torque 
from gun recoil. This device prevents stick-slip oscil- 
lation when driving an inertia. whereas an “irrevers- 
ible’ worm drive does not stop stick-slip. Patents 
2266237, Dec. 16.194 |: 2402073. June 11, 1 94h. 


. A torpedo director (Mark 2). Newell simplified the 


mathematical basis. which enabled the size of the 
computer to be cut in half. Six of these systems saw 
service in World War I. Patent 2403542. July 9, 
Lot, 

A director for defense against horizontal bombing 
runs. By restricting its applicability. Newell was able 
to do it with a much simpler computer than was in 
use. Patents 2403543. July 9, 1946; 2403544. July 9, 
1 O46, 


. A combination of a coarse and fine synchro, using a 


cam-driven link to switch between coarse and fine. 
The patent application was filed in 1934, but the 
work had been done before that. Patent 2405045, 
July 30. 1946. 


. A single-ball integrator with a rack to eliminate 


tangent function effect. Patent 2412468. Dec. 10. 
1046. 

A scheme to prevent large inertial load on a hydrau- 
lic servo from overshooting. which involved intro- 
ducing a spurious signal to start slowing it down 
betere it reached the intended position. This was 
particularly important in synchronizing S-inch guns 
and in bringing heavier guns to a loading position. 
Patents 2427154. Sept. 9.1947: 2840992, July 1, 154. 
A triangle mechanism to generate the square root of 
the sum of the squares of two input position vari- 
ables. Patent 2435818. Mar. 30. 1448. 

A scheme for using trains of balls, with wheels and 
steering rollers. to integrate complicated trigono- 
metric functions and solve the fire-control tracking 
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14. 
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22, 


23. 


24. 


pe 


problem (related to the earlier fundamental work of 
Maxwell, Ventosa. Hele-Shaw. and Smith).‘’ Patent 
2528284. Oct. 3 +. 1950. 

An electrical servo (with Henry F. MeKenny). Pa- 
tents 2448387, Aug. 31,1948; 2546277, Mar. 21.1951. 
A printing press registration scheme using a photo- 
cell (with McKenny). Patent 2576529. Nov. 27, 195 }. 
An electronic analog resolver — given a magnitude 
R and an angle A. it computes the components R sin 
A and R cos A continuously while compensating for 
the magnetic distortion of the R input. Patent 
2646218, July 21. 1953. 


. A “rate control” system whereby splash- or hurst- 


point error data generated by a spotter topside 
would cause automatic continuous computation of 
corrections to target course and speed (patented in 
the name of Ford et al.). Ford had developed a rate 
control system that reversed the computation and 
found target course and speed, but in doing so inter- 
rupted the generation of the prediction problem. 
Newell used component integrators to generate cor- 
rections to the target course and speed from the 
spotting corrections without interrupting the conti- 
nuity of the fire-control solution. Friedman gives the 
equations.” Patent 2702667, Feb. 22. 1¥35., 


. A rhumb-line mechanical (later electrical) computer 


for Air Force navigation along a great circle from 
one given longitude/latitude to another. Thousands 
of them were built. Patent 2783942. Mar. 5.1957. 


. An offset bombing director to allow homing on an- 


other point when the target cannot be seen (with 
Lawrence Brown). Patent 2615170, Dec. 3. [YS7, 
A mechanical integrator with reduced friction sur- 
face area. Patent 2693709, Nov. 9,144. 


. An “error reducer” unit for reducing greatly the 


pointing errors of main battery guns (developed in 
about 19M}, Patents 276358. Sept. 25, 1956: 
2800769. July 34). 1957. 

An electrical device containing tapped potentiome- 
ters for generating a class of functions of three vari- 
ables. Patent 2817478. Dec. 24. 1957. 

A computing device for predicting the deck angles 
of an aircraft carrier at the instant an airplane would 
be landing. Patents 2817479, Dec. 24.1957: 2888195. 
May 26. L954: 2888203, May 26.1959: 2978177. Apr. 
4. 1961: 2996706. Aug. 13,1461: 3174030. Mar. 16. 
1965. 

A parachute-release device. with Howard Brevoort. 
Patent 2834083, May 13. 1958. 

A device for squaring using a cone and cylinders 
(with S. Rappaport). Patent 2854854. Oct. 7. 19%. 
A computing module for correcting for the tilt of gun 
trunnions. Patents 2902212. Sept. I. ]!a¥: 2920817. 
Jan. 12, 1960; 1967663. Jan. IO. 161. 

A depth control for torpedoes using a gyro to sense 
attitude. It avoided the error in the previous Uhlan 
gear design, which had been due to use of a pendu- 
lum for attitude sensing. During initial acceleration 
this gave a spurious attitude signal which caused a 
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31 


deep and many times disastrous dive. Patent 
2920596. Jan. 12, 1960. 


A torpedo motion simulator for engineering pur- 
poses based on the torpedo equations of motion, 
including the water mass and inertia associated with 
the torpedo. Such a simulator was built for develop- 
ment purposes at Ford Instrument Co., possibly the 
first torpedo simulator. 


A “strapped-down” navigation system not using any 
gimbals (developed on a contract in 1958). In a 
personal communication, Newell said he considers 
this to be one of his potentially most important 
inventions. Patents 3049294, Aug. 14,1952; 2087333, 
Apr. 30. | Y63. 


A scheme for developing an electric current from a 
hot rod and a magnetic field. This is the other inven- 
tion that Newell considers to be potentially most 
important. Patents 3075096, Jan. 22. 1963; 3084267, 
Apr. 2. ]63. 


Newell and Willard B. Constantinides developed a 
deck-tilt corrector which corrected gun angles ap- 
proximately for the level and cross level angles of the 
deck. 


The mechanical analog technology was extended in 
1945 for the development of a bomber navigation 
trainer, mainly by Willard B. Constantinidesof Ford 
Instrument Co. It solved the equations of motion of 
an airplane with far greater generality, realism, and 
precision than the contemporaneous pneumatic 
computers in the famous Link trainers, which dealt 
only with small linear perturbations about steady 
flight. To record the trajectory of the airplane as 
projected on the horizontal plane, the Ford simula- 
tor drove electrically and remotely a mechanical 
“crab” that drew a curve on a large sheet of paper 
on the floor. 


A scheme for using resistors (standard but trimmed 
to precise values of a 1 {HM]- 1 range) to obtain ampli- 
fier input gains, which was patented. 


In the foregoing list the items that were mainly electrical, 
as distinguished from mechanical or hydraulic. were nos. 12, 
13. 14. 16.20.29. and 31. 


Many more people than have been mentioned played 
notable roles under Ford and Newell. Certainly the follow- 
ing at least also deserve to be named here: Ray Jahn, George 
Crowther. George Hamilton. Charles Buckley, Walter C%in- 
able (the nephew of H.C. Ford), John Kallenberg, Howard 
Brevoort. and Elmer Garrett. During World War II they 
were assisted by Charles Henrich, Charles Pond, Kenneth 
Crawford (brother of Perry). Rasmus Figenschou (of Nor- 
way). John Hauser. George Licske. Mrs. George Elder (née 
Athena Rosarkv). «4]01 Mertz. and the author and other, 
then junior, design engineers. 
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Abstract 


English version 


Computer simulations and computer aided design in the past decades have evolved into a valuable 
instrument, penetrating just about every branch of engineering in industry and academia. More 
specifically, computational fluid dynamics (CFD) simulations allow to inspect flow phenomena in a 
variety of applications. As simulation methods evolve, mature, and are adopted by a rising number 
of users, the demand for methods which not only predict the result of a specific configuration, 
but can give indications on how to improve the design, increases. 

This thesis is concerned with the efficient calculation of sensitivity information of CFD al- 
gorithms, and their application to numerical optimization. The sensitivities are obtained by 
applying Algorithmic Differentiation (AD). 

A specific emphasis of this thesis is placed on the efficient application of adjoint methods, includ- 
ing parallelism, for commonly used CFD finite volume methods (FVM) and their implementation 
in the open source framework OpenFOAM. 


Deutsche Version 


In den vergangenen Jahrzehnten haben sich Computersimulation und Computer gestitzte Design 
Methoden zu einem wertvollen Instrument entwickelt, welches nahezu jeden Bereich der technisch- 
und naturwissenschaftlichen Forschung und Wirtschaft beeinflusst. CFD Simulationen erlauben 
es, Stromungsphanomene in einer Vielzahl von Anwendungen zu untersuchen. Je mehr numerische 
Simulationsmethoden sich fortentwickeln und Anwendung finden, umso grofer wird der Bedarf 
an Methoden, die nicht nur das Ergebnis von spezifischen Konfigurationen vorhersagen konnen, 
sondern auch Hinweise zur Optimierung des Designs geben konnen. 

Diese Arbeit beschaftigt sich mit der effizienten Berechnung von Ableitungsinformationen auf 
CFD Algorithmen und deren Anwendung zur numerischen Optimierung. Die Ableitungen werden 
mittels Algorithmischen Differenzierens (AD) generiert. 

Ein besonderer Augenmerk dieser Arbeit liegt auf der effizienten Anwendung adjungierter 
Methoden, unter Berticksichtigung von Parallelismus, im Kontext von popularen CFD Finite- 
Volumen Algorithmen, und deren Implementierung im Open-Source Stromungsloser OpenFOAM. 
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1 Introduction 


1.1 Outline 


The focus of this thesis lies on the efficient generation of sensitivity information for computational 
fluid dynamics (CFD) applications, using the discrete adjoint approach. Methods and algorithms 
are presented in a general CFD setting, and are implemented in a discrete adjoint OpenFOAM 
framework. ‘The application of these sensitivities to specific optimization problems is demonstrated 
in several cases. ‘The presented optimization cases are to demonstrate the flexibility of the 
sensitivities obtained by the discrete adjoint framework, and do not necessarily represent the 
state of the art in numerical optimization. A thorough discussion of state of the art constrained 
optimization techniques would be outside of the scope of this thesis and remain as an avenue for 
future research. 

This thesis is structured as follows. Chapter 1 introduces notations and conventions used in this 
thesis. Related work and the most important contributions of this thesis are briefly summarized. 
Furthermore, a high level overview over numerical simulation and optimization is given. ‘The 
most important theoretical foundations, as well as the motivation for this thesis, are laid out 
in Chapter 2. Basics of the CFD method, Algorithmic Differentiation (AD) and optimization 
are covered. 

Building on the foundations, the application of AD to CFD algorithms is discussed in detail in 
Chapter 3. Starting from a black-box approach as a proof of concept, a variety of improvements 
are presented, which increase the efficiency of AD in the context of iterative CFD solvers. Parallel 
adjoint communication is incorporated into the CFD solvers to retain the primal parallelism, 
requiring adaptations of the linear solvers. 

Using these results, Chapter 4 discusses different strategies implemented to more efficiently 
generate adjoints for steady state simulations. Furthermore, the generated adjoint sensitivities 
are verified against tangents and continuous adjoints. Alternative optimization methods, such as 
parametric optimization are discussed and implemented. 

Extended case studies for both topology optimization and shape sensitivities are presented 
in Chapter 5. A scaling study on current HPC hardware, the RWTH compute cluster, was 
performed. The results showcase the scalability of both the primal and adjoint implementations. 

This thesis closes with an overview over related work, carried out or supervised by the author, 
as well as a summary and outlook. A brief developer documentation for the discrete adjoint 
OpenFOAM implementation can be found in the appendices. 

















1.2 Notation 


Here we introduce the most important notations and conventions, used throughout this thesis. 
Non-standard notations will be reintroduced once they first become relevant. 


Vectors: Bold letters, individual entries are numbered starting from zero, e.g. v = |vo, V1, v2]- 
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For spatial information, alternatively alphabetic indices v = [vz, vy, vz] might be used. 


Matrices: Upper case letters, individual entries are denoted with lower case letter and numbered 


starting with zero. E.g.: 
A= (‘oe a) | 
aio 411 


Scalar product: The scalar (inner) product between two vectors x,y € R” is denoted by 
= 
XY =X y= ry Lig. 


Outer product: The outer product between two vectors x,y € R” is denoted by 
x®@y= xy! Ec RrXn, 


Functions: Lower case letters or in calligraphy, e.g. f(x) or J(y). 


Placeholder: e stands for a placeholder variable, to which super or subscripts and modifiers can 
be attached, e.g. © = e/1). 


Unit systems: If not otherwise specified SI units are used. 





Unit system for bytes: For the designation of bytes, we use the binary multiple system, dif- 
fering from the SI definition, also known as TiB, GiB, MiB, KiB: 1TB = 1024'GB = 
10247 MB = 1024° KB = 1024? bytes. 


Spatial and temporal relations: In the context of FVM discretizations, spatial relations are 
indicated by sub indices, e.g. x;. Temporal (or pseudo temporal iteration) relations are 


indicated by upper indices, e.g. voile = i 1; ay.) 


Powers: ‘To avoid confusion with the temporal indices, if not immediately obvious by context, 
we indicate powers by enclosing the target in brackets first, e.g. (y)?. 


Gradient operators: In the context of FVM discretization, the gradient operator is used as a 
spatial operator, ignoring temporal dimensions, if any: 








Op 
Ox 
Scalar Gradient: Vp = ae 
Op 
Oz 
é Our Oy Ouz 
Ox Ox Ox Ox 
* oni: ¥ =, | 10. — | Ou, Oy Buz 
Vector Gradient: Vu= V @u= 7 | War Us, Uz | = aa oe 
oO Ou, Oy Ouz 
z Oz Oz Oz 


Our OUe = OUg 
Ug Ox Oy Oz 


_ OO 0 O| _ | Ouy Ouy duy 
Outer product: u® V = | uy E- Dy $.| =|>f of a2 
Us Ouz Ouz Ouz 








Ox “Oy Oz 


. . . _ OU OUy Ouz 
Divergence: V-u= 52 + By OE 
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Laplace Operator: Au = V-u=V (Vau)= ae 4 om as vs ow 
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1.3 Related Work & Contribution 


In the field of CFD and numerical optimization, a wide range of prior knowledge exists. While 
the foundations of fluid mechanics research range back several centuries |And16], the application 
of computational methods to discretize and obtain solutions to the governing equations became 
popular in the 20th century, with the advent of general purpose computing hardware |Sha04]. 
Compared to the history of CFD, the widespread application of adjoint optimization techniques 
to CFD problems is more recent, however first applications still date back to the 1970s (see 
e.g. |Jam03; GP00| for a short overview of the history of adjoint methods in the context of CFD). 
Most researchers apply either a continuous adjoint formulation, or a discrete adjoint formulation 
on the residuals obtained from solutions of the FVM systems |Gil+-03]. Algorithmic Differentiation 
(AD) is concerned with the generation of derivatives of algorithms, given as computer programs 
(for a historical discussion of AD see [GW08]). While the rules of AD can be applied by hand, for 
complex codes a tool driven approach is highly desirable. 

The application of AD to complex CFD tools is often limited to specific numerical kernels, 
tailored to a predefined goal. A full differentiation of code frameworks is seldom achieved, be 
it for lack of applicable tools or too extreme memory requirements imposed by the data flow 





reversal of the adjoint mode. A recent applications of an AD tool driven approach to a complex 
CFD code includes SU2 [Ecol18; ASG16]. Workflows based on source code transformation were 
used in |Zho+18] and [|MHM18}. 

This thesis pursues the approach to initially cover as much as possible of the used CFD 
framework by AD. The advantage of such an approach is twofold: First, an initial differentiated 
version of the problem can be obtained very rapidly, without much analytical insight into the 
underlying problem. Starting from there, possible optimizations can then be identified, applied, 
and evaluated. Second, a full AD implementation gives the flexibility to pursue a wide range of 
optimization tasks, without needing to adapt the underlying CFD framework to each of these 
applications. Starting from a naive black-box implementation of AD, treating the CFD algorithms 
as a general computer program, different tactics are employed to improve performance and lower 





the memory footprint. ‘This is achieved by exploiting prior knowledge, blurring the lines between 
fully discrete and continuous approaches. ‘To the author’s knowledge, this work is the broadest 
application of AD to the general purpose CFD tool OpenFOAM. Other adjoint solvers for 
OpenFOAM exist, in the form of continuous adjoint solvers and implementations of the discrete 
residual adjoint methods, possibly involving finite differences (FD) [OVW07; He+18]. 
OpenFOAM is particularly suited for this approach, due to its split architecture into a 
general CFD framework, (which was fully covered by AD), and individual solvers and utilities 
based on the framework (which were covered by AD as needed). The changes required for the 
introduction of AD into a complex CFD code base are discussed in detail in Section 3.2. Adjoint 
communication patterns ({Sch14]) were implemented, such that the parallel convergence of the 
primal is retained for adjoint calculations (Section 3.5), which is demonstrated on the RWTH 
compute cluster (Section 5.1). The symbolic differentiation of linear solvers (|Gil08], Section 3.4) 
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required the careful inspection of the FVM discretization matrices involved, in order to correctly 
implement the symbolic adjoints of the parallel matrix vector products (Section 3.5.5). 

The application of AD to iterative solvers poses challenges in terms of memory consumption. 
To overcome a memory bottleneck encountered, an existing adjoint vector compression technique 
was implemented ({NL18], Section 3.8.2). This induced a bigger than expected run time penalty. 
To overcome this, a novel optimization to the adjoint vector compression technique was developed, 
implemented, and studied (Section 3.8.3). This allowed to regain most of the performance of the 
unaltered adjoint vector implementation. 

The developed discrete adjoint solvers are verified against adjoints obtained by the continuous 
adjoint method (Section 4.3). The implications of a frozen turbulence assumption versus a full 
differentiation of the turbulence models are studied. The availability of a fully differentiated 
Spalart-Allmaras turbulence model facilitates the calculation of shape sensitivities of an airfoil 
(Section 5.2). In addition to topological and shape sensitivities, a parametric optimization setting 
is presented (Section 4.6), demonstrating the differentiation of a solver directly coupled to a 
mesh generator. A similar parametric approach was chosen in |Aur+16], however differentiating 
an external CAD environment and mesher. 

A novel connection between the FVM mesh formulation and the bipartite (partial) graph 
coloring formulation was developed, allowing to effectively obtain compressed Jacobians. ‘The 
method and resulting colorings are presented in Section 4.5.4. The obtained colors are used to 
efficiently calculate the Jacobians of the FVM residuals using tangent or adjoint mode. A similar 
approach is presented in [He+18], also implemented in OpenFOAM, however in this work the 
authors used different coloring algorithms and FD to determine the Jacobians. 

While many concepts are presented, such that they can be conveniently implemented in 
OpenFOAM, the methods developed and used in this thesis are applicable to a wide variety of 
CFD algorithms and optimization tasks. 

Previous publications of the author, related to the topics of this thesis, include |TN13; TSN15], 
and |TN18]. Parts of Sections 3.2 and 3.3 (black-box differentiation of OpenFOAM and checkpoint- 
ing) were discussed in |TN13]. Parts of Section 3.5 (parallel black-box adjoints using adjoint MPI) 
were discussed in |TSN15]. Furthermore, parts of Sections 3.4, 3.5 (symbolically differentiated 
linear solvers, embedded into SIMPLE algorithm, involving parallelism), and 5.1 (HPC study of 
3D Pitz-Daily case) were previously presented in |TN18]. 








Prior Publications 


(TN13] M. Towara and U. Naumann. “A Discrete Adjoint Model for OpenFOAM”. In: Proce- 
dia Computer Science 18.0 (2013). 2013 International Conference on Computational 
Science, pp. 429-438. 


(TN18] M. ‘Towara and U. Naumann. “SIMPLE Adjoint Message Passing”. In: Optimization 
Methods and Software (2018), pp. 1-18. 


[TSN15| M. Towara, M. Schanen, and U. Naumann. “MPI-Parallel Discrete Adjoint Open- 
FOAM”. In: Procedia Computer Science 51 (2015). 2015 International Conference 
on Computational Science, pp. 19-28. 
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=> 


Figure 1.1: The V-model of product development, also known as Vee-diagram. Modeled 
after [Osb+05]. See also VDI 2206 |GMO3]. 


1.4 Numerical Simulation 


1.4.1 Introduction & Motivation 


Numerical simulations of physical problems have become an integral part of many modern 
design processes |Ott+03]. Simulation is able to augment classical design verification processes, 
relying on physical experiments, with numerical experiments. The repeatability of numerical 
experiments allows the efficient exploration of the design space for many different configurations 
and boundary conditions. Virtual product development allows to move the test and evaluation 
of preliminary designs towards earlier in the development cycle, avoiding costs like expensive 
tooling for pre-production models. Model candidates can be evaluated in a shorter time period, 
allowing to iterate through the stages of development at a faster pace. With correctly chosen 
models, numerical simulations allow to create observations for conditions which might not be 
readily reproducible or observable in a lab environment, such as: 


e reduced or zero gravity |CS93], 





e extreme temperatures or pressures |Anc+97], 





e extreme length scales, e.g. quantum level [HTK12]| or astronomical scales [Aba+-03], 
e short (e.g. microseconds) |Hes+08] or long (e.g. centuries) [Cro00] time scales, 

e hazardous or restricted processes, e.g. nuclear explosions [KGG18], 

e or probabilistic quantum level effects [LW90]. 


Industry fields which heavily rely on numerical simulation to reduce product development times 
include the automotive |Tho98], aerospace |[SV16], and defense |NW11] industries. Statistical 
simulation approaches are prevalent in the financial sector |GGO6]. 

Product development, be it physical (mechanical) or virtual (software), is usually an iterative 
process |'TN86]. Requirements are identified, incorporated into a design, and implemented. The 
implementation is tested and integrated into the systems context (e.g. a part into a car or 
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a module into a software framework). During testing and integration, more likely than not, 
issues are identified, that feed back into the requirements, starting off another iteration of the 
development process. This iterative design process is often illustrated with the V-model of product 
development. Many different variants of this model exist, one of them is shown in Figure 1.1. 

The incorporation of numerical simulations into product development is often called computer- 
aided engineering (CAE), in distinction, and as a supplement to, the more classical computer-aided 
design (CAD). Numerical simulation can help to reduce the time spent in each iteration of the 
product development cycle. 

In civil engineering, the finite element method (FEM) |Hug12] is the dominant method of 
predicting the stresses in materials and the resulting deformations. In fluid mechanics, the finite 
volume method (FVM) [Pat80| is widely popular, but alternative methods, including FEM and 
Galerkin |CKS00| methods, are available. 

Multiphysical simulations aim to bridge the divide between different fields of science, e.g. by 
coupling fluid simulation with structural analysis (fluid structure interaction, FSI) |Zim06], or by 
coupling fluid flow and heat transfer (conjugate heat transfer, CHT). 





1.4.2 Computational Fluid Dynamics 


Computational fluid dynamics (CFD) is the application of computational methods to the field of 
fluid mechanics. Early applications date back to the beginnings of the 20th century, where solutions 
were obtained by hand or mechanical computers |Hun98]|, however wide spread application has 
only been realized with the introduction of general purpose computing devices |Wen08]. Today 
CFD methods are an important application for current high performance computing (HPC) 
resources. With the ever increasing performance of current HPC infrastructure, more complex 
simulations (multi-scale, multi-physics) and solution methods, such as direct numerical simulation 
(DNS) |Ors70] have become feasible. This thesis focuses on applications of the FVM method 
(introduced in the next chapter) in the context of CFD applications. However, other discretization 
methods are available, such as finite element methods (FEM) or probabilistic methods. The most 
common application of CFD is the solution of the Navier-Stokes partial differential equations, 
which allow to predict the flow of viscous fluids. For fluids, that exhibit a negligible amount of 
viscosity, the less complex Euler equations can be solved instead. This is particularly useful for 
compressible flows in transonic flow. 

In the context of FVM CFD simulations, a wide variety of commercial, academic, and open- 
source software codes exist. OpenFOAM (Open Source Field Operation and Manipulation) is an 
open-source software suite, which has a strong user base in both industry and academia, due to 
its versatility, good parallel scalability, and lack of licensing costs. Development of OpenFOAM 
started in the early 90s as a research project at Imperial College London |Wel-+98; Jas96|. It was 
first released to the public as a commercial code, called FOAM. Later the code was released as 
open-source under the GNU GPL-v3 |GPL] in 2004, forming OpenFOAM. Development remains 
very active, however development is currently fragmented in three different forks. The rights to 
the OpenFOAM trademark currently belong to ESI Group, which develops and distributes the 
OpenFOAM-plus fork. 

In this thesis the applicability of the discrete adjoint mode of AD to the CFD design optimiza- 
tion process is explored. As a demonstrator for these techniques a discrete adjoint version of 
OpenFOAM-plus, developed by the author, is used. 





1.4 Numerical Simulation 


For numerical optimization, gradient based methods are popular, as they offer better convergence 
compared to gradient free methods, requiring less evaluations of the underlying simulation models. 
The efficient computation of derivatives is its own research field, with different strategies for 
obtaining the derivatives. This will be discussed in detail in later sections. 


1.4.3 Topology Optimization 


As an example for numerical optimization in the context of CFD simulation, we will use the 
topology optimization technique throughout this thesis. 

The concept of topology optimization was first introduced in FEM analysis of mechanical 
stresses |Ben89]|. Topology optimization aims to find an optimal structure, relative to some cost 
function, within a given design space and boundary conditions. The difference between topology 
optimization and a classical shape optimization is that the topology optimization does not start 
from and adapt an initial design, which for a non-global optimization might introduce a strong 
bias toward a specific design, but is allowed to explore the full design domain. For example, a 
truss structure might be optimized to be as light as possible, while still withstanding a set of 
load conditions. 

Topology optimized designs often appear rather organic in shape and, if optimized without 
design constraints, might be hard to manufacture. However, with the increasing sophistication of 
additive manufacturing methods |Nin+15] and advanced CAD systems, highly optimized complex 
structures are increasingly common, e.g. in the aircraft industry [Emm-+11]. An example for 





such a structure, manufactured with additive manufacturing, is given in Figure 1.2. Topology 





optimization methods generally use many optimization parameters, therefore efficient calculation 
methods are needed. Ideally, the computational complexity should be independent of the number 
of parameters. This motivates the usage of adjoint methods, introduced in later sections. 





Figure 1.2: Structural part of an Airbus A350, optimized for weight. Part sintered from metal 
by additive manufacturing. Source: Airbus |Air18]. 
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In the field of structural FEM, different flavors of topology optimization have emerged, such as 
level set methods [SM13]. 

In contrast to the history of applications in FEM, the application of topology optimization 
to CFD problems is comparatively recent |BP03]. Topology optimization immerses a geometry 
into the available design domain, by selectively blocking cells not located inside the geometry 
for the flow, e.g. by some penalization. A big advantage of CFD topology optimization is that 
the same mesh representation can be used for all optimization stages, eliminating the need for 
expensive remeshing or mesh morphing. If needed, the domain can still be refined near the 
topology boundaries. A disadvantage of the naive implementation is that the immersed geometry 
is not separated from the rest of the design space by real walls, to which wall boundary conditions 
could be easily applied. This complicates the application of wall boundary conditions (e.g. heat 
transfer) and the evaluation of turbulent wall functions. Furthermore it reduces the accuracy 
of the solution. If higher physical accuracy near the walls is required, the immersed boundary 
method [Pes02| can be combined with topology optimization |Mit+08]. 

Topology optimization for ducted flows was introduced to OpenFOAM using continuous 
adjoints |OVW07; Oth08]. The introduction of penalty terms to the Navier-Stokes equations, as 
well as the derivation of the continuous adjoint equations are discussed in detail in Section 2.5. 








2 Foundations 


In this chapter the necessary foundations, needed to comprehend the later chapters, will be 
laid. A brief introduction to (computational) fluid dynamics is given, before the discretization 
with FVM is introduced. Algorithmic differentiation and general optimization methods will be 
discussed. A brief introduction to the AD tool dco/c++ will be given, which will be used later on 
to illustrate abstract concepts with code examples. Finally, the discrete adjoint residual approach 
is introduced. Additionally graph coloring techniques are presented, which will later be applied 
to more efficiently evaluate the discrete adjoint residuals. 








2.1 Navier-Stokes Equations 





The Navier-Stokes [Nav23; Sto51| equations are the most prevalent equations in CFD. They 
describe the conservation of momentum and mass for viscous fluids. In the following, the basics of 
general conservation laws are presented. ‘They are subsequently used to derive the Navier-Stokes 
equations for incompressible Newtonian flow. 








2.1.1 Derivation from Conservation Laws 


The fundamental laws of physics dictate, that certain physical quantities in a system remain 
constant, as the system evolves in time. Important conservation laws include the conservation of 
energy, momentum, and mass. 


Reynolds Transport Theorem 


In the context of fluid flow, the conservation laws specify the conservation of physical quantities over 
moving material volumes M, which travel along the fluid. However, for practical computation it is 
desirable to express these laws for control volumes V. Mass can travel over the system boundaries 
through inlets and outlets. An observed mass volume thus leaves the system after a relatively short 
period of time, making material volumes inconvenient. The Reynolds transport theorem |RBMO03| 
provides the required connection between a material volume / and corresponding control volumes 
V. 

Let W be a quantifiable attribute of a flow (e.g. mass, momentum, energy) and let w = ce be 
the intensive value of VW (amount of ~ per unit mass m), that is 


v= | pam. 


For a material volume M, the total change of quantity WV is determined by the change of V in V 
plus the net flow of W into and out of the control volume through its control surface S. Let p be 
the density of the fluid, u € R° the velocity and n € R® the outward facing normal to the control 
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surface S. ‘Then the Reynolds transport theorem states that 


d O 
gf mea =f Slov)av+ | pun as, 





dt Jas 
or by transforming the surface integral to a volume integral by applying the divergence theorem 
d O 
gf peat = [ (Flo) + ¥-(ovw)) av. (2.1) 
M V 


Derivation of Mass Conservation for Incompressible Flow 





The principle of conservation of mass states that, without the presence of mass sources and sinks, 
the mass of fluid in a region M will be conserved, that is it does not change over time: 


a 
dt Im 


The mass conservation for a fixed control volume V can be derived from Equation (2.1). Choosing 
W = m the corresponding intensive quality is yw = 1 and 


Op Z 


For this equation to hold for any control volume V, the integrand has to vanish at each point 
veEQcCR?: 


Op 
ads : —(). aoe 
ey +V-+pu=0 (222) 


For incompressible flows, the density is assumed to be constant throughout the domain in both 
space and time (dp(x,t)/dt = 0). Thus, Equation (2.2) can be simplified to the differential form 


V-u—0. 


This equation is called mass conservation or mass continuity equation. From the equation directly 
follows, that the velocity field of an incompressible fluid is divergence free. 
Applying the divergence theorem gives the integral form 


[wnas=o. 
S 


which makes it obvious that all flow (mass) that enters a system must leave it again at some 
other point. 








Derivation of Momentum Conservation for Incompressible Flow of Newtonian Fluids 


Newton’s second law states: 
dmu 


dt =a 


where u is the velocity of an object, m is the mass, and f are forces acting on the object. From 
this, the momentum conservation equation can be derived by considering the velocity u as the 
conservation variable w in Equation (2.1) 


O 
5 [eu av + | pumas = >e, 
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where f are the external and internal (viscous) forces acting on the system, replacing the material 
volume integration. ‘The forces on the right hand side will now be expressed with intensive 
quantities, by making the assumption of Newtonian fluids. For Newtonian fluids, it is assumed 
that the shear stress 7 inside the fluid can be expressed by the shear velocity du/dny as 


du 


7 Pan,’ 


where n, is the direction perpendicular to the flow and p is the (dynamic) viscosity. 

From this assumption the following relations for the stress tensor 7’ and the deformation 
tensor D can be derived (see e.g. [BW97]|). They only include intensive quantities and introduce 
the pressure p into the momentum equations. 


2 
T=—(p+5n¥-u) 1+ 2D 


D=-=(Vu+(Vu)") . 


KO | ke 


With those tensors, the acting forces can be split into internal viscous forces and external forces 


d 
— pu dV + f puu-nds=[ T-nas+ [ pbav. 
dé Jy S s V 


where b are the external body forces per unit mass. By introducing an infinitesimally small 
control volume V, one obtains the differential form of the momentum conservation equations 


O(pu) 
at 





+V-(puu) = V-T+ pb. 


For isothermal and subsonic flows, the assumptions of constant fluid density and viscosity are 
usually valid. With those two assumptions, the derivatives of p and yz vanish in both space and 
time, allowing to simplify the equations into 


0 1 
a tue V)u=vVu——Vp+b, (2.3) 


where v = j/p is the kinematic viscosity. 


Navier-Stokes Equations for Steady Incompressible Flow 


Combining the conservation of momentum and conversation of mass into one system of partial 
differential equations, the governing Navier-Stokes equations for steady (Ou/Ot = 0) incompressible 
laminar flow read as follows: 


1 
(u® V)u=vV*u—- ra +b (2.4) 


V-u=0. (2.5) 
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2.1.2 Navier-Stokes Equations for Topology Optimization 


As mentioned in Section 1.4.3, topology optimization in a CFD context can be implemented by 
blocking the flow in specific cells of the discretization, corresponding to regions which should be 
excluded from the design domain. A commonly used approach is to block the flow by introducing 
an artificial resistance for the flow in the momentum Equation (2.4): 


1 
(u@V)u=vV~u—- oes (2.6) 
V-u=0. 


The new term au in (2.6) implements a momentum sink, which introduces a flow resistance and 
as such allows to penalize regions of cells considered counterproductive for the flow. [llustratively 
the resistance term au can be interpreted as a porosity. ‘The pressure drop Ap over a porous 
medium of depth Az is commonly modeled by Darcy’s law |Whi86| as Ap/Aw = —(/kK)u, 
with permeability «. Therefore the momentum source is an implementation of Darcy’s law 
with p/K = a. 

The question in which regions to penalize the flow, that is to find an optimal field a, motivates 
the need for (adjoint) sensitivities of a given objective with respect to the (discretized) parameters 
Qa. 

In order to avoid momentum sources, which would accelerate the flow instead of penalizing it, 
the values of a should be constrained such that a > 0. To avoid too stiff discretization matrices 
and to obtain a steady converged state for the field a, the parameter should also be capped below 
a certain maximum value Qmax. 

The constrained optimization problem to find a feasible and optimal field a can be stated as 





minimize J(u(a), p(a),a) 
subject to r(u(a@),p(a),a) =0 
O<a< Omax 3 


where 7 is a scalar or multivariate objective and r are the residuals of the momentum and mass 
conservation equations. 


(a) no optimization (b) added porous medium (c) reconstruction 
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Figure 2.1: Typical topology optimization workflow, from baseline geometry (left) to final 
optimized geometry (right). 
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2.1 Navier-Stokes Equations 


A typical three step topology optimization workflow is depicted in Figure 2.1. Starting from 
an initial configuration of the domain (left), cells are penalized to improve the flow according 
to some objective (middle). To reduce discretization errors and to make the design parametric 
again, the geometry is transformed to a CAD representation, and remeshed. On this geometry 
the final flow state is calculated (right). 





2.1.3 Turbulence Models 


In contrast to laminar flows, turbulent flows are characterized by strong irregularities of all their 
properties. Most importantly they exhibit strong changes of their velocity and pressure in space 
and time |Dur08]. Those irregularities are chaotic in nature and occur on a variety of turbulent 
length scales. ‘Turbulent flows exhibit a high amount of dissipation due to the internal viscous 
shear stresses. 

Broadly speaking, turbulent behavior occurs once the flow has passed a critical Reynolds 
number. The Reynolds number Re is defined as 

ee ee 
mM y 

where L is a characteristic length of the flow domain (e.g. diameter of a pipe, length of a car), 
and u is the scalar velocity magnitude of the fluid with respect to the obstacle. ‘The somewhat 
arbitrary choice of characteristic length L makes the Reynolds number only a rough indicator for 
the flow properties of the flow. A pipe flow with Reynolds number Rep > 4000 is considered to 
be fully turbulent, while a flow with Rep < 2000 is considered laminar. Flows in the region in 
between 2000 and 4000 are in a transition area, and are consequently called transition flows. 

The analysis of turbulent flows is a very relevant field, as most fluid flows occurring in nature 
or technical applications exhibit some degree of turbulence. 








The most intuitive, yet very expensive, way of calculating solutions to turbulent flows is to 
utilize direct numerical simulation (DNS). With DNS the Navier-Stokes equations (2.5) are 
solved without any further modeling of the turbulent nature of the problem. For DNS to give 
reasonable results, all length scales of the physical problem, from the smallest turbulent length 
scales to the macroscopic scale, have to be resolved directly, both in space and time, by the 
chosen discretization. This resolution requirement makes DNS challenging and even on current 
HPC hardware it is rarely used for complex geometry. One of its main uses is to derive and 
understand turbulence models and to simulate flows which are very hard to model otherwise, e.g. 
the laminar/turbulent transition region. With the transition of HPC to exascale computing, DNS 
methods will likely become more prevalent, as the lower algorithmic complexity of DNS should 
improve parallel scalability |Che+09]. 

The minimization of the computational requirements, while still reasonably capturing the 
physical influence of turbulence, has been a very active field of research over the last decades (for 
a historical overview see e.g. |Wil+98]). In the following sections a brief introduction of the most 
common one- and two-equation turbulence models for steady flows will be given. 





One Equation Model: Spalart-Allmaras 


A popular one equation turbulence model is the Spalart-Allmaras model |SA92]. It is popular due 
to its comparatively modest computational effort, involving the solution of only one additional 
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Figure 2.2: Visualization of the Reynolds decomposition. Function wu is decomposed into a 
smooth (filtered) part u and chaotic part u’. 


PDE. This model is particularly suitable for aerospace and turbomachinery applications, as it 
was developed for aerodynamic flows. 

It models the transport of the kinematic eddy turbulent viscosity v as a convection-diffusion 
equation: 


ae 5) 4. C2 awa - C 7 
07) _-y. (pov) + pl Val? + Cn.pS0 (1 — fr) - (Ga fae fu) po +S. 





The equation involves several empirically determined constants C’,a,«, as well as other closure 
equations. For these definitions, we defer to the literature [SA92; AJ12]. From the complexity of 
this governing equation it can be figured, that the derivation of a continuous adjoint turbulent 
model (see Section 2.5), while possible |[Zym-+09], is laborious and the implementation error prone. 





Two Equation Models: Reynolds-Averaged Navier-Stokes Equations 


The Reynolds-averaged Navier-Stokes equations (RANS) are time averaged equations of motion. 
They are used to transform a flow field, that is transient only at the turbulent length scale to a 
steady flow where the amount of turbulence is modeled by an additional set of variables. 

The velocity of a flow is split into a mean and a fluctuating part using Reynolds decomposition, 


u=u+u, 

where U is the mean (time averaged) velocity and u’ the fluctuating turbulent part. (In the 
literature the mean velocity is usually denoted with u, to avoid collision with the adjoint notation 
we use the differing notation u.) Figure 2.2 illustrates the superposition of a (synthetic) continuous 
differentiable function wu, with randomized values from a standard distribution of lower length 
scale u’. This example mimics the characteristics of the Reynolds decomposition. 

The effect of the turbulence can be modeled by an additional transport equation. The RANS 
momentum equation for steady incompressible flow can be given as follows (with 7 = {0,1, 2}, 
Einstein summation over the indices 7, and x = |[%0, 21, £2| = |x, y, 2]): 


i _ Op “ O ( OU; - wu | 


Ox; “Ox; Ox; “Ox; 
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2.1 Navier-Stokes Equations 


For the solution of the RANS equations, additional closure terms are needed for the u;‘u;’ 
term, which is commonly referred to as Reynolds stress tensor. It is the closure of this term, 
in which the different RANS turbulence models differ from one another. Using the Boussinesq 
approximation [Pop00], the Reynolds stress tensor can be related to a turbulent eddy viscosity 1. 
In the following, two common closure models are presented, which enable to calculate 4, namely 
the k-e and the k-w models. 





k-e model: The k-e equations model the transport of turbulent energy k and turbulent dissipation 
(into heat) rate e [LS83]: 


Ok  O(ku;) O | 





| a 20, bij LG; 6 





Ot Ox; 7 Ox; on OX; 

Oe  O(Eu;) O |M% Oc € €? 
7 = = |—~— 6 24. Eig — € — Coe 
Ot Ox; Ox; E a 7 k os y ; C2 k 


The eddy viscosity needed in the equations, as well as for the closure of the Reynolds stress tensor 
is calculated as 4% = C,k?/e. 

The k-e model is suited for the calculation of sheer free layer flows [Bar+97| and flows with 
low pressure gradients. However, it does not perform well for flows that exhibit large adverse 
pressure gradients |Wil+98]. 


k-w model: The standard k-w model |Wil+98], with default parameters reads as: 


Ok Ok Ou; 9 O 1 Ok 
Ow Ow 5w Our, 3 5. O 1 Ow 

at “ide, ok ge a + On; (o+ 57) = : 

The eddy viscosity needed in the equations, as well as for the closure of the Reynolds stress tensor 
is calculated as 4% = k/w. The k-w model is best suited for flows with wall effects. Extended 
versions like k-w SST (shear stress transport) [Men93] exist, which switch between k-w behavior 
near walls and k-e behavior in the free stream. 





Other Models 


If one desires to capture the transient effects of the turbulence or calculates a case for which the 
mentioned RANS methods are not well suited, other more computationally expensive methods 
are available. One particularly useful method is the large eddy simulation (LES) model. The 
LES model is a filtering approach, which removes the information of the flow field on the lowest 
turbulent length scale frequencies and models the effect of that information on the flow field by 
various approaches. A flow field filtered in such a way allows to calculate on a coarser mesh, 
compared to a DNS case. LES simulations are inherently transient, they thus consume a lot of 
computational resources. ‘The requirements are amplified when coupled with the calculation of 
the adjoint, as potentially very long iteration histories need to be reversed. ‘The detached eddy 
simulation (DES) is a combination of the RANS and LES approaches, that switches between a 
RANS model in regions of the flow with very small turbulent length scales (especially near walls), 
which can not be covered by LES without a prohibitively fine mesh, and a LES model in regions 
where the spatial resolution of the mesh allows to resolve the turbulence by LES. 
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2.1.4 Computational Mesh 


The calculation of solutions to the governing equations using FVM requires the discretization 
of the domain with finitely small control domains. ‘They need to cover the full computational 
domain, without overlapping, follow the shape of the boundary, and fulfill certain quality criteria. 
This domain discretization is commonly called mesh or grid (for structured meshes). For general 
CFD applications, meshes can be categorized into one of two classes: Structured or unstructured. 
Structured meshes allow the efficient determination of the cell neighborhood of a given cell without 
costly memory lookups. For example, in a 2D mesh, where the cells are numbered in a matrix 
like (row, column) fashion, the neighborhood of cell (away from any boundaries) (i, 7) can be 
obtained by (¢ — 1,7), (¢+ 1,7), (4,7 —1), (4,7 +1). In general this leads to good cache locality of 
the code implementing the discretization, and sparse (banded) discretization matrices with known 
sparsity pattern. Structured meshes do not necessarily have to be equidistant or orthogonal 
(see e.g. Figure 2.3). However, the number of neighbors and connectivity for each cell is fixed, 
severely limiting the possibility to refine the mesh and the adaptability to complex geometries. 
This downside can be mitigated by allowing hanging nodes, resulting in a mesh which retains the 
good numerical characteristics of a structured mesh, while still allowing to refine near important 
geometry features. 

Unstructured meshes allow the decomposition of the computational domain into arbitrary 
polyhedral subvolumes. ‘Thus, the number of neighbors of a cell is not fixed, and the neighborhood 
can not be trivially constructed. ‘The mesh connectivity must be stored in a suitable data structure 
and looked up when the neighborhood needs to be constructed. This leads to more memory 
access and a non cache-local memory access pattern. 

The convergence speed and solution quality of many numerical solution algorithms depends 
on the mesh quality. Mesh quality can be characterized by different metrics, the importance of 
which varies between different solution algorithms. 














Cell aspect ratio: Ratio between the biggest and smallest area of a cells boundary box (best if 
ratio equals one). 





Cell non-orthogonality: Angle between the vector connecting two cells and the face normal of 
the face connecting both cells (best if equals zero). Compare to vector Sy in Figure 2.4(a). 


Cell non-conjunctionality: Distance from the intersection of the connection of two cell centers 
with the face connecting both cells (f") to the face center (f.) (best if equals zero). Compare 
to Figure 2.4(b). 
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Figure 2.3: Three structured meshes. Structured equidistant mesh (left), structured mesh 
refined around region of interest (middle), structured non-Cartesian mesh (right). 
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2.2 Finite Volume Methods 


(a) Mesh non-orthogonality (b) Mesh non-conjunctionality 


Figure 2.4: Visualization of mesh non-orthogonality (left) and non-conjunctionality (right). 


2.2 Finite Volume Methods 


The finite volume method (FVM) was developed in the 1970s and 1980s as a way to discretize and 
solve partial differential equations (PDEs). The FVM is popular in CFD, as it allows to discretize 
directly on the physical domain without transforming to a computational domain first |MMD-+16| 
(like the test function space required by FEM). It is able to discretize complex domains, granted 
that enough volumes are available to capture the features of the domain. It is related to the finite 
difference (FD) and finite element methods (FEM). For a comparison between FVM, FEM, and 
FD see e.g. [EGHOO]. 

The FVM is based on the concept of balance equations on discrete control volumes. In a first 
step the PDEs, given in differential form, are integrated and transformed into balance equations 
over the discrete control volumes. The resulting volume and surface integrals are transformed into 
discrete expressions by approximating them with numeric quadratures. In a second approximation 
step the quantities from the cell centroids are interpolated onto the faces and are numerically 
integrated over the face area. The surface integrals describe the flux of conservation quantity in 
and out of the volumes through their respective faces. 

The formulation of fluxes through the cell faces makes the method locally conservative (meaning 
that the flux leaving a volume is equal to the flux entering its adjacent volume). This makes the 
method well suited for problems where fluxes are of importance, such as fluid mechanics. It can 
be applied to both structured and unstructured meshes in arbitrary 2D and 3D domains. 


2.2.1 FVM on Scalar Transport Equation 


We will motivate the FVM with the discretization of the general convection diffusion equation. It 
models the transport of a scalar physical quantity (e.g. temperature) within a fluid field, which 
travels with given velocity u. ‘The quantity is convected downstream by the velocity field. In 
addition to convection, the quantity spreads in all directions, due to diffusivity. 

The transport of scalar quantity ~ in a fluid flow can be modeled by the following partial 
differential equation [|MMD-+16]: 


O 
“pe Ve(pud) = V-UVy)t Qh, 
SH WY 


transient term convective term diffusive term source term 


17 


2 Foundations 


where p and u are the density and velocity of the fluid field transporting w, and p is the diffusivity 
constant for wy. The equation is closely related to the Navier-Stokes momentum equation. ‘The 
momentum can be stated as the transport equation, where w = u. The non-linearity introduced 
in the convection complicates the formulation of the FVM, thus for now we will treat ~ and u as 
separate entities. Starting from the PDE formulation, the construction of the FVM follows along 
the lines of the derivation of the governing equation, but backwards. 

Integrating the differential steady state form (O(py)/Ot = 0) over a (finite) volume Vo C 0 C R?® 
gives the following integral form: 


V -(pud)dV= | V-(uVd)dV4+ [| QYav. (2.7) 
Vo Vo Vo 


Transforming the volume integrals for convection and diffusion to surface integrals by applying 
the divergence theorem: 





f (pu) -ndS = / (uVw)-ndS+ | QY’dV. 

AVo AVo Vo 

This states, that, in absence of transient effects, the amount of quantity entering and exiting each 
finite volume Vc driven through the convective and diffusive fluxes, has to match the amount 
created /removed by the source term. From this integral formulation over cells and faces, we aim 
to introduce approximations, which transform the integrals on each volume Vg into sums. The 
summands of the individual elements can later be assembled into a global linear system, which 
allows to solve for the field of the unknown quantity w. 

The surface integrals of the fluxes are approximated by replacing them with numerical inte- 
erations on the faces. In the discretized domain the values of the quantity are only explicitly 
defined at the cell centers. The values thus need to be interpolated from the cell center to the 
cell faces. ‘The most common option is to choose one integration point at the face center and 
multiplying it with the face area, yielding a second order accurate approximation (the integral of 
linear functions is evaluated exactly by this integration). We define the total flux J as the sum of 
the convective and diffusive flux as 








J” = pu) — pVw. 


The integration over the flux through face f can then be approximated by 


f3-aSaIy-sy. 
f 





where J is the flux calculated at the midpoint of face f and sf is the face area vector. 

The per face calculation of the fluxes from values at cell centers of the adjacent cells makes the 
cell centered finite volume scheme inherently conservative. ‘The flux leaving a control domain 
over one face exactly equals the flux entering the adjacent control domain (with opposite sign). 

For orthogonal meshes, the vector connecting two cell centers does pass through the center 
of the face shared by both cells. In this case the value of yw at the face center can be linearly 
interpolated from the neighboring cells as 





Tat) |; || 
~{1- : 
Ys ( fay) °° + pay? 
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Figure 2.5: ‘Two orthogonal cells, sharing one face. Cell centers are connected by d = d; + do. 


where d, is the vector connecting the midpoint of cell C’ to the face center qf and d the vector 
connecting qc and qn (see Figure 2.5). 
The gradient in direction of the normal n = d2/||d2|| can be approximated by the difference 
wn — UC 
ld] 
Thus, the flux integral of the face can be evaluated by a linear expression, depending only on the 
quantities of w at the cell centres of cell C’ and its neighbor cell N: 





J¢-S¢ =acvc +anyn. 





For non-orthogonal meshes, the vector connecting two cell centers does not necessarily pass 
through the center of the face shared by both cells (see Figure 2.6). If the intersection point 
is used instead of the actual face centerpoint, the numerical interpolation to the face loses its 
second order accuracy. Furthermore, the face normal direction, along which the gradient has to 
be evaluated, does not align with the vector connecting the cell centers anymore. The flux can 
be corrected to better match the flux which would be obtained by using the correct values at 
the midpoint. However, as this correction contains non-linear terms it can not be added to the 
system matrix but instead has to be treated as a source term for the linear equation system. 


wp=aclc t+antn + flvo, n)- 


Discretizing the remaining volume integral of the source term from (2.7) by a numerical integration 
over the cell 


Q? dV & Qh Vo, 
Vo 


we can express the balance equation at each cell by the expression 


acvo+ SY) arbp=be, 


FXNB(C) 


where F' ~ NB(C) denotes all cells in the neighborhood of C’. The source term, as well as the 
non-linear parts of the non-orthogonal corrections, are accumulated into the right-hand side bc. 
From those balance equations one can lump together all cells into a linear equation system: 


Ayw = by, 
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Flow direcuion. 


Fs 








Fy 





Figure 2.6: Cell centered finite volume. Flux is calculated through faces. Note the distance 
between face midpoint and the intersection between face and cell center vector due to non- 
orthogonality. 


where the matrix coefficients ac constitute the matrix diagonal Ago and the coefficients ap 
populate the off diagonal elements Aro and Agr. The matrix can be solved to obtain the desired 
quantities w. The linear equation system needs to be assembled and solved multiple times, if 
non-linear effects are included in either the right-hand side (due to the non-orthogonal correction) 
or the assembly of the matrix (i.e. the matrix entries depend on w, which is commonly the case 
when discretizing the momentum equation, where the velocity is both the desired quantity and 
driving the convection). 

The fluxes of each cell only directly depend on the values at the cell itself and the directly 
neighboring cells. ‘Thus, the matrix Ay, will in general be sparse. Figure 2.7 shows the sparsity 
pattern resulting from the discretization of the momentum equations for the motorbike tutorial 
case of OpenFOAM. 

This case generates a ng = 352570 cell unstructured mesh with nr = 1054817 internal faces 
leading to a 352570 X 352 570 sparse matrix with ny, = nc +2-nr = 2460120 non-zero elements. 
Straight from the mesher the bandwidth of the matrix is 349 965 (left in the figure), a renumbering 
with the Cuthill-McKee algorithm |CM69] reduces the bandwidth to 20875 (right in the figure). 
A reduced matrix bandwidth can lead to improved linear solver performance |Maf14]. For example, 
the incomplete LU factorization |CV97]|, utilized as a preconditioner in many iterative solvers, 
benefits from the lower fill-in generated by a mesh reordered with Cuthill-McKee. 











2.2.2 Common FVM Boundary Conditions 


In the FVM, fluxes over the domain boundaries are defined by boundary conditions defined on the 
boundary faces. A variety of boundary conditions exist to model diverse physical behavior. Here 
only the two most common choices are presented, namely the Dirichlet and Neumann boundary 
conditions. The former directly specifies the value of a transport quantity at the boundary face, 
the latter prescribes a flux. 
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Figure 2.7: Sparse FVM discretization matrix of the momentum equations for a large unstruc- 
tured mesh (motorbike tutorial case of OpenFOAM). Left in original node order from the 
snappyHexMesh mesher, right ordered with the Cuthill-McKee algorithm. 


Dirichlet Boundary Condition 


For a scalar quantity w, which is to be convected, e.g. at an inlet, the quantity on the boundary 
is set explicitly to a (possibly time and space dependent) specified value: 


Wy — De ee 


n A representation of a cell stencil in a structured Cartesian mesh, with a Dirichlet boundary 
condition applied to the west face, is shown in Figure 2.8. ‘The flux through the boundary face 
can be explicitly calculated with the specified value as 


dh = Mp - 











The flux only depends on the specified value and not on the cell central value wo. ‘This corresponds 
to an upwind interpolation of yy, from a virtual cell center qy. 








Figure 2.8: Finite volume stencil around a cell C’ with a boundary face to the west. Flux wp is 
specified on the boundary face and enters the FVM discretization of the cell. 
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2 Foundations 


Examples for Dirichlet boundary conditions include: 
e fixed velocity profile at an inlet, e.g. u = Ujn - Ny, 
e velocity u = (0,0,0) at a no-slip wall boundary, 


e and fixed pressure at outlet. 


In OpenFOAM the Dirichlet boundary condition is specified by the fixedValue keyword, as 
shown in Listing 2.1 for a constant inlet velocity of 1 m/s. 


dimensions [eal Sil 1) I= 
boundaryField{ 
inlet{ 
type fixedValue; 
value uniform (1 0 0); 
} 
} 


Listing 2.1: Fixed velocity of 1m/s at patch inlet, specified in case configuration file 0/U. 


Neumann Boundary Condition 
The Neumann boundary condition fixes the flux through a face to a specified value: 


bb = g\lSoll , 


where q is the flux per unit-area and ||sp|| is the face area of the boundary face. The prescription 
of the face flux is equivalent to fixing the gradient in face normal direction Vy» - ny to a fixed 
value. As the mass flux m is considered to be constant, the quantity y, must change accordingly 
to realize the desired flux. 

Common applications of the Neumann boundary condition are: 





e zero gradient pressure condition at an inlet Vp- nz, = 0, 
e zero gradient velocity condition at an outlet Vu- ny, = 0, 


e and symmetry plane. 


In OpenFOAM the Neumann boundary condition is specified by the fixedGradient keyword. 
The by far most often used application of the Neumann boundary condition is to prescribe a zero 
gradient condition, which can also be set by using the zeroGradient keyword. ‘This is illustrated 
for a zero gradient pressure condition on the inlet and domain walls in Listing 2.2 


dimensions [0 2 -2 000 0]; 
boundaryField{ 
inlet{ type zeroGradient; } 
walls{ 
type fixedGradient ; 
value 0; 
i 
$ 


Listing 2.2: Zero gradient pressure condition at patch inlet, specified in configuration file O/p. 


2a 


2.3 Semi-Implicit Solution Algorithms 


2.3 Semi-Implicit Solution Algorithms 


The SIMPLE (Semi-Implicit Method for Pressure Linked Equations) Algorithm |PS72] is one 
example for a class of solvers for systems of nonlinear partial differential equations (PDEs). The 
algorithm solves the steady incompressible Navier-Stokes equations by linearizing the equations, 
and discretizing them according to the FVM. ‘The linearization of the convection term makes 
an outer iteration loop necessary. The embedding of linear equation system solvers into a more 
general outer iteration loop is a common occurrence in CFD and more general simulation codes 
which involve nonlinear (differential) equations [Nau+15]. 

The SIMPLE Algorithm decouples the momentum equations from the mass conservation 
equations. ‘his allows to build and solve the linear equation systems for velocity and pressure 
independently, making the resulting linear systems considerably smaller and easier to solve. ‘The 
Navier-Stokes equations can also be solved fully coupled which makes the inner iterations more 
expensive but considerably speeds up the convergence of the outer iteration. 

For the solution of the momentum equation, the pressure is assumed to be known and accordingly 
only enters the equations on the right hand side. The momentum equations are solved component 
wise, giving a velocity field U* = (ug,...,U;,,-1) which fulfills them. For the velocity components 
outside of the implicit direction, the newest solutions from previous SIMPLE iterations are used, 
the coupling between the spatial directions is only introduced during the correction steps, 


*,2+1 __ a 
Ay, U, ee b,,, (U’) 
Au, Us wy bu, (U") ’ 


where the right hand side vector by is a function of the velocities in the previous iteration 
step. At the beginning of the algorithm, the pressure, driving the velocity field, is only a guess. 
The velocities will in general not fulfill the (discretized) mass conservation equation. Therefore 
additional corrections are required. 

When solving the momentum and mass conversation equations separately, there is no equation 
which allows to solve for the pressure field p directly, as the momentum equation is already used 
to uniquely determine the velocities, while the mass conversation equation only directly depends 
on the mass fluxes and hence the velocities. ‘To close the equations, two correction terms for the 
pressure and velocities are defined. If the corrections can be related to one another, they can be 
used to determine a new pressure field. 

A velocity correction is wanted which corrects the velocity field, such that it fulfills the mass 
conversation at every cell: 





U=U"2U'~" 
where U* are the velocities fulfilling the momentum equations and U’ are the corrections needed 
to obtain velocities u consistent with the mass conversation equation. 

We assume that the velocity corrections can be uniquely determined from a separate pressure 

correction 

p=p'+p, 
where p* is the current guess for the pressure, and p’ is the correction required such that the 
pressures p drive the velocities to fulfill the mass conservation equation. 


23 


2 Foundations 


Tying the velocity corrections to the pressure corrections allows to assemble a linear equation 
for the pressure corrections 
/ 
App =b,, 


which can be solved implicitly. ‘The right hand side vector b, bundles the explicit contribution of 
the pressure correction equation. The obtained pressure corrections p’ can then be used to correct 
the pressures and also to explicitly correct the velocities to better solve the mass conservation 
equation (the latter correction is optional, but improves convergence). The alternating solution of 
the momentum and pressure correction equations is repeated until the flow field has converged. 

In the discretization of the governing equations OpenFOAM introduces the scalar quantity @, 
that describes the mass flux of fluid through a cell face (the convective part of the total flux J 
introduced in Section 2.2.1). The mass flux is required in the discretization of the pressure 
correction equation. It is defined as: 


= pA(u-n)=p(u's). 


For incompressible problems, the normalization of the Navier-Stokes equations with p leads 
to a normalized mass face flux @ = u-s. As a face flux, it is defined on the cell faces, as 
opposed to the cell centered quantities like velocity and pressure, giving the discrete mass flux 
field @ € R”*. Intuitively the mass fluxes could be interpolated from the cell centered velocities 
onto the faces, however this introduces numerical issues, commonly referred to as checkerboarding. 
Instead the following iterative procedure is chosen to update @¢ in each iteration of the SIMPLE 


algorithm |RC83]: 
H 1 
o= (A2) - vp) °S, 
Ap Up 





where H(u) is the product of the off-diagonal coefficients of a specific cell in the discretization of 
the momentum equations (see also Section 3.7.3) and a,» is the diagonal coefficient. 
Summarizing, the SIMPLE Algorithm consists of the following steps: 





1. Discretize the momentum equations and assemble linear equation system; 
2. Solve discretized momentum equations to obtain intermediate velocity field U: 


3. Compute the uncorrected mass fluxes @ at the cell faces; 





4. Discretize the pressure correction equation and assemble linear equation system; 
5. Solve discretized pressure correction equation to obtain pressure correction field p’; 


6. Calculate new pressure field p from p’, if desired apply under-relaxation: 
pt! =p'+a(p—p’), with 0<a<1; 


7. Correct the mass fluxes @ at the cell faces with the calculated pressure corrections; 
8. Correct velocities U to fulfill mass conservation, yielding U 
9. If desired, apply under-relaxation: U’T! = U' + a(U — U’), with 0 <a <1; 


10. Repeat steps 1—9 until convergence of pressure and velocity fields is obtained or maximum 
number of iterations has been exceeded. 
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The SIMPLE algorithm is popular due to its rather straightforward implementation. For better 
convergence with less under-relaxation, the improved SIMPLEC algorithm can be implemented. 
Due to the decoupling of the PDEs, both algorithms produce linear equation systems which 
are significantly smaller and less stiff than a fully coupled system. Due to the steady growth 
of computing capabilities, a fully coupled approach becomes increasingly feasible for practical 
applications. Such solvers are available in the FOAM-extend project |Jas+18]. All applications of 
AD, shown on the SIMPLE algorithm in the later sections, are feasible for those types of solvers 
as well [STN]. 


2.4 Mesh & Field Conventions 


2.4.1 Mesh Topology 


OpenFOAM uses the FVM approach, as outlined in Section 2.2, to discretize ODEs. The 
computational domain is discretized into finitely small volumes, also called cells. We define the 
following terminology to construct the cells from basic entities. 





e Let 2 be the computational domain 2 C R®° and I = 00 the boundary of that domain. 


e Let Q be a set of distinct points Q = {q € (QUT )} (the obvious symbol choice of p is 
already taken by the pressure). 


e Let E be a set of edges, each connecting two points, E = {(qi,q;) | qi,aj € Q,4 #7}. 


e Let F be a set of faces. A face is made up of an edge cycle, containing at least three edges. 
The cycle is defined as a path ((qg, G1), (G1, G2);--->(Gn;Q9)), where the start and endpoint 
of the path are identical and each other point is unique. All points q; of a face lie on a 
common plane in 3D space, that is the face is planar. 


e Let Fr be the subset of F' containing all boundary faces. A boundary face is a face of which 
all points indexed by its edges lie in I. 





Let Fp be a set of subsets of faces called patches. Fp = {Fp; C Fr}. Boundary conditions 
can be applied per patch. Thus the main use for patches is to group boundary faces, to 
which a boundary condition should be applied. 


e Let Foe be the set of interior faces F' \ Fp. An interior face includes at least one point which 
lies in (2. 


e Let C be a set of cells. A cell c € C' is a space c C (2 which is enclosed by at least four 
faces without any gaps. Every edge of a cells face is shared by another face of that cell and 
potentially also by faces of other cells. 





e Let the number of cells |C| be denoted as nc, the number of interior faces |FeE| as nr, and 
the number of boundary faces |Fp| as np,. 


Definition 1 (Volume centerpoint). 
We define the centerpoint qo, € R° of a celle € C as the point (centroid) which coincides with 
the arithmetic mean of all points within the volume. For a formal definition see e.g. [Cox+69]. 
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Faces 


Cells 


Face Normals 


Points A 


Figure 2.9: Unit cube meshed with 64 hexahedral elements. Some cells are made translucent to 
show the interior cells and faces. Three instances of points, edges, faces, face normals, patches, 
and cells are marked. 


Definition 2 (Face centerpoint). 
Similarly we define the centerpoint qc aS R° of a face f € F as the point (centroid) which coincides 
with the arithmetic mean of all points on the face. For a formal definition see e.g. [Cox+69]. 


In practice the face centroid of complex shapes is computed by decomposing the face into triangles, 
calculating the areas and centroids of each triangle, and calculating the final centroid from the 
triangle centroids weighted by their area. Assuming uniform density throughout the domain, the 
face/volume centerpoints coincide with the center of mass. 


Definition 3 (Face normal vector). 

The face normal vector nf € R° is defined as the vector, that is perpendicular to the plane 
containing all points of the planar face f € F. By definition it points outside of the domain 
enclosed by the face, and is normalized to length one, that ts ||n¢\| = 1. 


Definition 4 (Face area vector). 
The face area vector sf € R® is defined as the product of the face normal vector with the face 
area Af: sf= Arne, that 1s ls || = Af. 





The face area vector s can be used to efficiently calculate fluxes through a cell face. 

Figure 2.9 demonstrates a very simple structured mesh of the unit cube. The points are spaced 
equidistantly in the Cartesian directions at a distance of . Consequently the mesh consists 
of 5°? = 125 points. The points are connected by 300 edges to form 240 faces. Of those faces, 
4-4-6 = 96 are boundary faces, that can be naturally assigned to six boundary patches coinciding 
with the sides of the cube. Each face can only be part of one patch, however it can be assigned to 
an arbitrary patch, depending on the required boundary conditions (which are set on a per-patch 
level). Finally, the faces form 4° = 64 hexahedral (cube) cells. 
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2.4.2 OpenFOAM Field Conventions 


Table 2.1 lists the OpenFOAM fields corresponding to the most important physical quantities for 
flows. The correlation between the notation for physical quantity and the corresponding discretized 
field is straightforward for all fields, except the velocity. As the velocity u € R? is already a vector 
quantity, we define the velocity field as U = (uo,x, Uo,y, 0,25 +++) Ung—1as Ung —Lys Ung—1z) € Rec, 
If the field of an individual velocity dimension is required, we denote it as U;, U,, U, € R"° 
respectively. This notation allows to discern between physical quantity and the discretized field, 
and is also consistent with the OpenFOAM notation for the velocity field, which is U. 





Table 2.1: Relation of physical quantities and their corresponding OpenFOAM fields for com- 
pressible and incompressible flows. 


Physical Quantity Symbol Field a Data type , See eens 
Velocity u U U volVectorField m/s m/s 
Pressure p p Pp volScalarField m? /s? kg m/s? 
Mass flux o) 0) phi surfaceScalarField m°/s kg/s 
Turb. kinetic energy k; k k volScalarField m? /s? m? /s? 
Turb. dissipation rate E E epsilon volScalarField m?/s° m7’ /s° 


For incompressible flows, OpenFOAM scales the pressure with the inverse of the constant density 
p, making the density disappear from the momentum equations (2.3). The unit of this scaled 
pressure, referred to as kinematic pressure, is consequently m*/s? instead of the more common 
kgm/s*. Also the face flux is calculated as u-s for the incompressible case while for the 
compressible case p(u-s) is used. 

Physical quantities in OpenFOAM have to be defined in a consistent unit system and arithmetic 
compatibility of different fields is checked at run-time. While different unit systems, such as 
imperial units are possible, SI units are most commonly used. To enable the run time check of 
units, the dimensions of a field has to be specified in its input dictionary by a vector specifying 
the cardinality of the individual units in a specific order. ‘The order of the unit specification is 
shown in ‘Table 2.2. Listing 2.3 gives an example for a velocity field with units length per time. 








FoamFile{type volVectorField;} 
dimensions [OWA -1 8 0 0 Oj]; 
internalField uniform (1 0 0); 


Listing 2.3: Definition of a field storing vectors with unit length over time, i.e. velocities. 
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Table 2.2: Order of units in the OpenFOAM unit system and corresponding SI unit [OF 18]. 


No. Property SI unit 
1 Mass Kilogram (kg) 
2 Length Metre (m) 
3 Time Second (s) 
4 Temperature Kelvin (K) 
5 Quantity Mole (mol) 
6 Current Ampere (A) 
7 Luminous intensity Candela (cd) 


2.5 Continuous Adjoints 


In the context of solving PDEs, the application of AD leads to a discretize first — differentiate 
later approach. That is the governing (primal) equations are first transformed from a continuous 
problem to a discrete one (defined on finitely small control volumes), optionally linearized and 
then solved. The derivatives of the governing equations are obtained by applying AD to the 
implementation of the discretization process. AD can be introduced at different levels of the 
discretization process, ranging from differentiating the whole discretization and solution process 
(discussed in Section 3.2.2) to only differentiating the calculation of residuals (see Section 4.5) 
and supplying additional analytical insight to obtain the full derivatives |Lot16]. We call this the 
discrete adjoint approach. 

In contrast, the continuous adjoint approach yields a differentiate first — discretize later 
setting. That is from the primal PDEs a corresponding set of adjoint PDEs, along with adjoint 
boundary conditions, is derived symbolically (usually by variational calculus). The resulting 
adjoint PDEs are discretized and solved separately from the primal equations (however primal 
variables may appear in the adjoint equation, resulting in a coupling between primal and adjoint 
equations). Therefore, the discretization of the adjoint equations can be tailored to the physical 
properties of the adjoint, e.g. by employing upwinding of the adjoint convection equation. In the 
context of the adjoint Navier-Stokes equations, the adjoint convection direction is opposed to the 
primal (different sign, see derivation in following sections). A visual comparison of the different 
approaches to obtain the final sensitivities is given in Figure 2.10. 

A drawback of the continuous adjoint method is that the obtained derivatives are not necessarily 
consistent to the primal, as implemented. For some PDEs, the adjoint equations are ill-conditioned, 
leading to bad convergence of the solution algorithms. Also the derivation process of the adjoint 
equations is complex and error prone. For complex, e.g. turbulence, equations, a closed symbolic 
derivation might not be possible at all [Car+10] without changes to the primal equations or 
boundary conditions |Kav-+15]. 

While this thesis is focused on obtaining derivatives using AD, we will briefly introduce the 
continuous adjoint equations, specifically to obtain topology sensitivities. We will use these 
methods to verify the AD implementation in later sections. This introduction is based on the 
derivations in [Oth08]. 

The solution of the equations listed above, including topology optimization using fixed stepsize 
steepest descent, is implemented in the standard OpenFOAM solver adjointShapeOptimization- 
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Figure 2.10: Comparison of solution procedure for the discrete and continuous adjoint. ‘The 
continuous adjoint differentiates on the differential equation level and then discretizes and 
solves the obtained adjoint equations. In contrast the discrete adjoint discretizes the primal 
equations first and then obtains the adjoints by applying AD to its implementation. Solid 
arrows connect the building blocks of the continuous adjoint, dashed lines the ones of the 
discrete adjoint. 


Foam [|OVWO07|. A comparison between the continuous and a corresponding discrete adjoint solver 
can be found in Section 4.3.3. 


2.5.1 Derivation of the Topological Sensitivity 


An optimization problem in CFD can be stated as the task to minimize an objective J, e.g. 
pressure loss in a system, under the constraint that the physical laws of fluid flow are fulfilled. 
Introducing the residual vector r = (71, 12,173, 74) as the residual of the Navier-Stokes momentum 
and mass conservation equations, we can state the (not yet discretized) optimization problem as: 
minimize J(a,u,p) 
(2.8) 
subject to r(a,u,p) =0. 


In residual form and neglecting external body forces the Navier-Stokes equations read as 


(r1,72,73)' = (u@ V)ut+ Vp —vV7u+ au 
TPA = —V-u. 
Constraint optimization problems can be reformulated into algebraic equations without constraints 


by introducing Lagrange multipliers. Lagrange multipliers are additional unknown variables 
which are introduced for every constraint equation. 





This allows to reformulate the general optimization problem f(x) under constraint g(x) = 0 from 
minimize f(x) 
x 


(2.9) 
subject to g(x) =0. 
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to the transformed problem 
A(x, A) = f(x) +A (g(x) , 


by introducing the Lagrange multiplier 4. 
A solution (x, A), for which the derivatives of A w.r.t. both x and A vanish, is a candidate for 
the solution of the constraint optimization problem (2.9) 


ON(x, A) 
Ox 
ON(x, A) 
OX 
By applying the Lagrangian multiplier approach to the constraint topology optimization prob- 
lem (2.8) a modified cost function LD is obtained. 


=0. 


c= 7+ | (a,p)-r a9 
Q 


Here the Lagrange multiplier is defined as (U, p) = (Uz, Uy, Uz, p) and ensures that the residual r 
vanishes at each location inside the computational domain (2. The multiplier i is called adjoint 
velocity, the multiplier p adjoint pressure. 

By applying variational calculus and assuming some restrictions on the feasible cost functions, 
the following relation for obtaining the desired sensitivities O£/Oq@ can be found [Oth08]: 


OL 
Oa; 7 








(a; : u;) V;. 


For this relation to hold, the Lagrangian multipliers must be specifically chosen, to ensure that 
the variations of £ with respect to u and p vanish. The values for the adjoint multipliers u and p 
can be found by solving the following additional PDEs: 
= (va 2 (vu)") u=V2a— Vp—ai 
V-u=0. 


Comparing to the primal Navier-Stokes equations, the convection of the adjoint velocity in the 
opposite direction of the primal flow can be clearly seen in the negative sign of the convective 


term (va =P (vu)") u. The equations are linear in U, however the matrix vector product (dot 
product on a per equation level) wu’ u introduces mixed terms, which require an iterative solution 


if a segregated solver is used. That is, the convection is discretized as (vai? + (Vi')") u’. 


2.5.2 Derivation of Adjoint Boundary Conditions 


The boundary conditions for the continuous equations depend on the chosen cost function J and 
have to be individually derived for each desired cost function. The boundary conditions for the 
use with ducted flows and total power loss, as defined in (2.10), are given in Table 2.3. 


J = -| (>+ sllul?) u-ndr (2.10) 
r 2 
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Inlet Wall Outlet 
MT: ty = Uae U, =0,u=0 Vu-n=0 
p Vp:-n=0 Vp-n=0 P = Poutlet 
GD U%=0,t,=Un %=0,U,=0 unl: —Uu:)+V(n-A)uy: =0 
p Vp-n=0 Vp-n=0 p=U-Ut tn, +y(n- A)ty, + —fu? — v2 


Table 2.3: Primal and adjoint boundary conditions for inlet, walls, and outlet in ducted flows. 
Primal /adjoint velocities normal/tangential to the patches are denoted by un = u-n, uz; = u-t, 
Uy, =U Ds. Uy = Ut: 


2.5.3, Continuous Shape Sensitivity 


By making an assumption connecting the adjoints to shear forces, the adjoint momentum equations 
can also be used to derive a continuous optimization procedure for surface shapes |Oth08]. For the 
detailed derivation of those relations, consult |Oth08]. More advanced derivations for different cost 
functions, as well as turbulence models are found in a variety of publications by Giannakoglou et al., 
e.g. [Zym-+09]. For a specific amount of outward facing movement in surface normal direction /, 
at an arbitrary point on the surface, the sensitivity can be computed as 

dL J 

— = —Ap(n- V)t- (n- V)uz, (2.11) 

dp 
with A representing the surface area affected by the move of the surface by 9. ‘The shape 
sensitivities thus depend on the gradients in normal direction of the primal and adjoint velocities 
tangential to the walls. 

Shape sensitivities give an indication which nodes have to be moved inward or outward respective 

to their face normals. A comparison between this approach and a discrete adjoint solver which 
directly uses the individual points of the mesh as parameters is shown in Section 4.4. 


2.6 Sparse Matrix Storage 


2.6.1 General Sparse Storage Schemes 


In a wide variety of (technical) applications, the matrices obtained by the discretization of 
nonlinear PDEs are sparse. For FVM, the matrices are sparse, because the stencils used to 
approximate the spatial and temporal derivatives are only influenced by a limited number of 
values in the direct neighborhood of the derivation point. A matrix is called sparse if the number 
of non-zero elements is low, compared to the number of zeroes in the matrix. ‘That is for a 
matrix A = (a;;) € R™*”: 








TL 
ee 
mn 





where Nnz = ||{ ai; | aij A O}]] is the number of non-zero elements in the matrix A. 

One distinguishes between structurally zero elements of a matrix and zero valued elements, with 
the former being a subset of the latter. For a matrix which is generated by some fixed algorithm 
(code), structurally zero elements are elements which are zero regardless of the values used to 
calculate them (as long as changes in the inputs do not induce changes in the control flow of the 
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algorithm). Additional zero valued elements can be created through numerical operations such 
as multiplication by zero. The sparsity pattern, that is the positions of all structural non-zero 
elements, of Jacobians/Hessians can be obtained by the propagation of pattern sets |Varl1]. The 
determination of sparsity patterns in OpenFOAM is presented in detail in Section 4.5.2. 

Exploiting the sparsity of such matrices is beneficial for a multitude of reasons (in the following 
we assume square matrices, or at least O(m) = O(n)): 





e Storage of the dense matrix is costly at O(n”), it is not uncommon for the dense matrix to 
not fit into memory at all. With sparse storage schemes (see below) the memory can be 
reduced to O(nnz). 


e The multiplication of a sparse matrix with a dense vector can be sped up from O(n”) to 
O(nnz) by eliminating memory access and operations on zero entries. 


e Direct solution of linear equation systems (e.g. using LU-decomposition) are sped up by 
exploiting sparsity. 





Sparse matrix storage schemes aim to reduce the memory required to store the matrix coefficients, 
while retaining all information of the matrix and providing efficient access routines to the individual 
entries. Popular choices for sparse matrix storage include the compact row storage (CRS) scheme 
and the coordinate format. 

We will illustrate both schemes using an illustrative example. Let matrix A € 
individual (structural) non-zero matrix entries a;; be defined as 


R**4 with n,, = 8 


ajo aoi O ODO 
0 Q11 Qa42 0 

020 0 99 0 

30 0 0 33 


A= 


The CRS scheme stores the non-zero elements of the matrix in left-to-right and top-to-bottom 
order in vector v (row-wise storage). Consequently, entries of the same row are always adjacent. 
The corresponding column indices of the nonzero elements are stored in vector cy. ‘The vector r; 
holds for each row the position of the first non-zero entry in this row in the vectors v and e;. 
Thus, the memory requirement depends on the number of non-zeroes and the number of rows: 
MEM = nnz - sizeof (value_type) + (nnz +m) - sizeof (index _ type). 








U0 = |ao0, G01, 411, 412, 420, 422, 430, agg] 
c; = (0, 1,1,2,0,2,0,3]° 
a — (0, 2,4, 6] : 





The coordinate format stores the non-zero elements in vector v, the row indices of the nonzero 
elements in vector r; and the column indices in c;. Consequently, the memory requirement is 
Anz sizeof (value_type) +2-nn,z - sizeof (index _ type). 
5 
Vv = |a00, G01, 211, 212, 420, 422, 430, 233] 
cy = 10, 0, 1, Mh 2, 2, 3, 3)" 
cr = [0,1,1,2,0,2,0,3)" . 
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The coordinate format is not unique, in that the entries in v can be stored in arbitrary order. In 
implementation it is advisable to store the entries of v in a cache efficient order. I[.e., if matrix 
vector products are required, v should be stored row-wise, such that subsequent entries can be 
efficiently read from cache. 

For nnz > m (for practical applications usually n,, >> m), the CRS scheme occupies less 
memory than the coordinate format. The adjacency of the nonzero entries of a row and the 
explicit availability of the column indices makes the CRS format well suited for performing sparse 
matrix vector multiplications. However, it adds a layer of complexity to the retrieval of individual 
matrix entries. A matrix stored in CRS format can not be trivially transposed. A matrix stored in 
coordinate format can be transposed by simply switching the row and column vectors r; and ¢;. 





2.6.2 OpenFOAM LDU Format 


OpenFOAM uses a variation of the coordinate format to store its matrix coefficients. ‘The specifics 
of the storage scheme will become important when symbolic adjoints are applied to the embedded 
linear solvers. In the context of CFD simulation, the diagonal of the FVM discretization matrix 
is always dense, as the discretization using finite volumes always yields a central coefficient. 
Furthermore, the discretization matrix A € R”"°*"¢ is structurally symmetric and square. 
The diagonal coefficients of the matrix are stored in a dense vector d = |a;;|0<i< no]. A 
diagonal entry a;; can be looked up directly by accessing d;; thus no additional row and columns 
indices need to be stored. All other non-zero entries are stored in vectors l and u, where 
l= lai; | aij A 0, 2 > J] are the coefficients below the diagonal and u = |a;; | aj; #0, 7 < J] are 
the coefhicients above the diagonal. In the implementation, the vectors J, d, and uw can be obtained 
by calling the access functions lower(), diag() and upper() respectively. Note, that to resolve 
the symbol clash between velocity u and upper entries u, all vectors corresponding to the LDU 
format are typeset in bold italics. 

For symmetric matrices, only one of the vectors 1 and wu needs to be stored. In addition to 
the non-zero values stored in l and/or u, the row and columns indices of the matrix entries need 
to be stored. ‘The indices are stored in addressing arrays L € N”"¥ and U € N”¥. The indices 
are ordered, such that EL is monotonously increasing and subsets of U with identical LE are also 
monotonously increasing (see following example). This ensures good caching performance for the 
evaluation of matrix vector products. ‘The addressing arrays can be obtained from the lduMatrix 
class by calling lowerAddr() and upperAddr() respectively. 

Due to the structural symmetry of the finite volume discretization, entries of the lower part are 
given by 





QU;,L; = his 
and the entries of the upper part by 
QL; ,U; = Uji : 
As an example let 
do Ug U1 0 0 0 
lo dy U2 UZ 0 0 
dy U4 U5 0 


ane 
| 


l4 d3 Ue U7 
lr l¢ da U8 
O Iv Ig ds 


SS: © 
COOn 
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be a 6 x 6 banded matrix storing 24 non-zero entries. Then the LDU format of A is given by 


l=([lo hh bb ly lu ls Io ly Isl 
d= [do di dz d3 dy ds| 

u = |uo Uy U2 UZ U4 U5 UE U7 us | 
L=(0011223 34 

U=(1 223 3 4 4 5 5]. 


For the solution of linear systems arising from the FVM discretization, boundary conditions need 
to be applied to faces which only connect to one cell. The boundary coefficients are stored in 
two additional vectors B and I, called boundary components and internal components. The 
coefficients in J are the coefficients which correspond to the influence of the boundary conditions 
onto the central coefficients of the matrix stored on the diagonal. In contrast to the l,d and u 
coefficients, those coefficients are not necessarily identical for all dimensions of a fvVectorMatrix. 
The coefficients of I for the correct dimension are only added to d on demand, before calling the 
linear system solver, and removed again after the solver has finished. The coefficients in B can 
be imagined as virtual cells outside of the computational domain, which arise from the boundary 
conditions. ‘They are added to the right hand side of the equation system on demand. 

For later reference, Listing 2.4 gives a truncated overview of the relevant lduMatrix.H and 
lduAddressing.H header files. 
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1 class lduMatrix { 
2| private: 
s //- LDU mesh reference 


4 const lduMesh& lduMesh_=; 

5 //- Coefficients (not including interfaces) 

6 scalarField *lowerPtr_, *diagPtr_, *upperPtr_ 5; 

7 |ouiisa aie 

8 //- Abstract base-class for lduMatrix solvers 

9 class solver { 

10 PrEobected: 

11 const FieldField<Field, scalar>& interfaceBouCoeffs_=; 
12 const FieldField<Field, scalar>& interfaceIntCoeffs_; 
13 lduInterfaceFieldPtrsList interfaces_; 

14 eee 

15 jenbllex Late; = 

16 const FieldField<Field, scalar>& interfaceBouCoeffs() const; 
17 const FieldField<Field, scalar>& interfaceIntCoeffs() const; 
18 const lduInterfaceFieldPtrsList& interfaces() const; 
19 eee 

20 ape 

pal 

22 //- Return the LDU addressing, access to coefficients 
23 const lduAddressing& lduAddr() const; 

24 scalarField& lower(); 


25 scalarField& diag(); 
26 scalarField& upper(); 


27 

28 bool hasDiag() const; 

29 bool hasUpper() const; 

30 bool hasLower() const; 

31 bool diagonal() const; 

32 bool symmetric() const; 

33 

34 //- Init the update of interfaced interfaces for matrix operations 
35 void initMatrixInterfaces([...]) const; 

36 

37 //- Update interfaced interfaces for matrix operations 
38 void updateMatrixInterfaces([...]) const; 

39 [ete 

40| }; 

41 


42, class fvMeshLduAddressing : public lduAddressing { 
43 |eae te Vener: 

44 labelList::subList lowerAddr_; 

45 const labelList& upperAddr_ ; 


46 pls ace 

AT //- Return lower addressing (i.e. lower label = upper triangle) 
48 const labelUList& lowerAddr() const; 

49 


50 //- Return upper addressing (i.e. upper label) 
51 const labelUList& upperAddr() const; 


Listing 2.4: ‘Truncated LDU description of class lduMatrix. 
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2. Differentiation of Computer Programs 


In this section the fundamental methods to obtain derivatives of arbitrary numerical code are 
presented. After the derivation of finite differences, first- and higher-order models of AD are 
introduced. 


2.7.1 Finite Differences 


Finite differences (FD) are a popular method to approximate the derivatives of uni- or multivariate 
functions. FD introduces approximation errors and, when implemented with floating-point 
precision, also additional numerical truncation and rounding errors. 
The derivative of a continuous function f : R > R at the location 29 is defined as 
d to th) — fla 
ae 2 ime ) = flxo) 
dx h-0 h 
If this limit exists, the function is called differentiable at the location x9. For finite values of h, 
this definition can be used to approximate the derivative: 
af) _ feo +h) = F(r0) 
da 


(x0) & | (2.12) 


This finite difference is called forward difference. Analogously the backward difference is defined as 


IF far9) me 1 eo) = Fe 
da” ~ h : 


Both approximations introduce an approximation error of O(h), as will be shown in Theorem 1. 
A more accurate approximation, scaling with O(h7), can be found at the cost of one additional 
function evaluation (assuming that f(xg) is already known and therefore evaluation at x9 does 
not incur additional cost). 





LF (9) yl Co Ww — Feo — F) 
da" 2h : 
The accuracy of the FD schemes can easily be proven using Taylor expansion, as shown in the 


(2.13) 


following three proofs. ‘The convergence properties will be used in a later section to verify the 
AD implementation. 


Theorem 1 (Onesided FD). 
Let f be a function R > R which is differentiable at all points in the interval |x, xo]. Then the 
approximation error of the forward difference (2.12) scales with O(h) for h > 0. 


Proof. The ‘Taylor expansion of function f at x = x) +h truncated after the third term of the 
infinite sum evaluates to 
df hedey 


f(ao +h) = f(xo) + h— (20) ar FD dgz (70) zi O(h”) : 


Subtracting f(a) on both sides of the equation and dividing by h gives 


Fleo+h)— fle) _ af , hey 
h dx 2dz? 
f(%o +h) — f(2o) 


(ro) = = + O(h). 


(xo) + O(h7) 


df 


=> 
dx 
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2.7 Differentiation of Computer Programs 


The proof for the backwards difference directly follows from the Taylor expansion 


d de 
Fao — h) = Fo) — AA (a0) + SF (ao) + O89). 


Theorem 2 (Central FD). 
Let f be a function R > R which is differentiable at all points in the interval |ao — h, xo + hl. 
Then the approximation error of the central difference (2.13) scales with O(h?) for h > 0. 


Proof. The Taylor expansions of function f at c = 79 +h and x = x — h, truncated after the 
third term of the sum evaluate to 


d h? d? 

Feo +h) = Feo) + HSE (0) + 5 TF (0) + O(N) 
d hdr 

Feo — h) = f(v0) — MSE (0) + TF (a0) + O10). 


Subtracting the second equation from the first, the second order term vanishes and leads to the 
desired approximation: 

d 
f(ao +h) — f(xo —h) = ont 


a6 


(xo) + O(h?) 


es (a) - aa O86 oi 4+ O(h2). 


LI 


The definition of the finite difference can be straightforwardly extended to the multivariate 
case f : R” > R, giving approximations for the 7-th directional derivative as 











ia (xo) = fog tpi) — Ix) wa — F(X) + O(h) 
CE 9) = Heal Meo 0) + oh) 
- (cq) = Peso + Rei) — So = Red) 5 a(n). 


FD can be applied to obtain second and higher order derivatives, by eliminating the first order 
derivative term from the Taylor expansion. This can be interpreted as reapplying the first order 
FD model to both evaluation points of the first order central difference. 


Theorem 3 (Second order FD). 
The second derivative f of a univariate function f(x): R— R, which is twice differentiable at all 
points in the interval |x — h,xo +h] can be approximated as 


WF (go) we Lota t h) = 2F(a0) + FoF) 
daz? 97 ~ h? : 


The approximation error scales with O(h"). 
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Proof. As before the Taylor series of f(a#o +h) and f(ao — h) are 


d h? d? h? d° 

Feo + h) = Fito) + RFE (e0) + 55 (e0) + = 55 + O(h!) 
d ede gee 

Fao — h) = f (a0) — ASE (0) + SF (ao) — SF 4 O(n"), 





Adding both equations and subtracting 2f(29) from both sides yields the desired result: 


Flo +h) + (eo — h) ~ 2f (a0) = WF (ep) + OCH) 


df 
dx? 


ca f(%o +h) — — + flo =k) O(h2). 


L] 


FD allows the evaluation of the Jacobian of a multivariate function f : R” > R™ at cost O(n) - 
cost( f), irrespective of the output dimension m. The truncation error can introduce numerical 
noise. The stepsize h needs to be tuned, such that it is low enough to give a reasonable 
approximation, but high enough not to encounter numerical issues due to machine precision. 
This is challenging, especially for multivariate functions with partial derivatives which differ in 
order of magnitude. Advantages are the comparatively simple implementation and the ability to 
differentiate complex models, to which the source code must not necessarily be available. 


2.7.2 Algorithmic Differentiation 


Algorithmic Differentiation |GW08; Nau12| (AD), sometimes also called Automatic Differenti- 
ation |Bar-+00], names the process of generating derivatives of a given (numerical) computer 
program, calculating the sensitivity of one or several outputs w.r.t. a set of inputs. Conceptually 
AD implements the evaluation of the chain rule on the sequence of operations connecting the 
inputs to the outputs. 

Let f(x) be a function which maps a vector to a scalar f : R” — R, and which is at least twice 
continuously differentiable (C*). Then the gradient V f(x) : R" — R” of that function is defined 


as 
= Oy (<u Oy ) 


Vf=) —e=(—,..., 
I 50 OX; Oxo OLn—1 








where e; denotes the 7-th Cartesian unit vector. 
The Hessian of f is given by the symmetric matrix H € R”*” of second order partial derivatives, 
where each entry fj; is given by: 


af 


e=hy = . 
i "9 Orda, 





A 


Let g(x) € C' be a different function which maps from a vector to a vector g : R" — R™. Then 
the Jacobian J of g is am X n matrix where each entry is given by: 


Ogi 


ii = Oe, 
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2.7 Differentiation of Computer Programs 


As can be seen from the above definition, the gradient is the special case of a single column 
Jacobian. 

AD allows to evaluate the derivatives of functions implemented as computer programs with 
machine precision accuracy. First order AD assumes that the function is differentiable at least once 
at all points of interest. AD relies on the fact, that each computer program can be decomposed 
at run time into a single assignment code (SAC), which we define below. 








Definition 5 (SAC). 

Each computer program implementing numerical functions can be decomposed into a sequence of 
elemental functions ~; and assignments, mapping n independent inputs to m dependent outputs 
with p intermediate variables: 


forj=n,...,.n+tptm-l1: 
Vj = Yj (Vi)ixjs 


where 1 < j denotes a direct dependence of the variable v; on v;. The result of each elemental 
function y; 1s assigned to a unique auxiliary variable v;. The n independent inputs 7; = vj, 
fori =0,...,n—1, are mapped onto m dependent outputs yj = Un+p+j, for 7 =0,...,m—1. 
The values of p intermediate variables v; are computed fork =n,...,n+p—1. If not otherwise 
specified we restrict the functions admittable as elemental functions to unary or binary functions, 
lamiting the number of arguments for each elemental function to at most two. 


Definition 6 (DAG). 
A directed acyclic graph G = (V, E) is a directed graph which contains no cycles, i.e. there is no 
directed path starting and terminating at the same node [TS11]. 





A SAC can be conveniently represented as a DAG, where the n + p+ m™ nodes are the inputs, 
output, and intermediate variables, uniquely defined by the elemental functions. The edges model 
the dependence of the elemental functions on intermediate values and the inputs: 





Ov; 
OU; 





(mek © 
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We label the edges with the partial derivatives, to aid the calculation of the full derivatives by 
executing the chain rule on the paths in the graph. An example program, its transformation to a 
SAC, and the corresponding DAG are shown in Figure 2.11. 

Tangent Mode of AD 


In this and the following definitions for AD models the notation from [Nau12] is used. To declutter 
the indices, later also the notations from |GW08] are employed for first order derivative models. 


Definition 7 (First order tangent model). 
Let f : R" > R™ with y= f(a), 2, 2) © R” and y, yy € R™. Then the first order tangent 
model aw > IR” x R” > R” X R” of f calculates the primal as well as the Jacobian, evaluated 


in direction x): - a 
yr \ — a) (0) ,.) _ (VF(2)- @ 
(%, 27 2) = (NS ) 
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The superscript e“) denotes the first tangent direction. The motivation for this notation will 
become apparent when higher derivatives are introduced. To make notation more compact, and 
to broaden compatibility to existing literature |GW08], for first order tangents we define the 
following equivalent notation: 


5.— ()). 


With the gradient in direction of a tangent x, the full Jacobian can be assembled at cost 
O(n) - cost(f), by evaluating the tangent model with all unit vectors x = e; € R”, giving one 
column of the Jacobian for each evaluation. Here O(n) includes the overhead of the tangent 
model evaluation F; , compared to a passive primal evaluation of f. 

The tangent model is now applied to each assignment of the SAC. Assuming differentiability of 
all elemental functions at their respective evaluation points, the tangent model of AD augments 
each elemental assignment with its tangent as follows: 





forj=n,...,n+t+p+m-l1 


Op; . 
= Do, Oj (2.14) 


i<j 





Uj = Yj (Vi)ij. 
Here the variables v are the tangents associated with the primal values v. ‘The directional 
derivatives are evaluated alongside the primal elemental functions, propagating them from the 


inputs to the outputs. 
As an illustration the tangent model for the function 


y = (Zo + 21)(1 + £9) 


is shown in Figure 2.11. Transforming this function to a SAC gives the following sets of inputs, 
output, and auxiliary variables: n = ||{vo, v1}|| = 2, m = ||{va}|]| = 1, p = ||{ve, v3}|| = 2. This 
SAC can alternatively be represented as the DAG shown in the figure. An auxiliary variable t is 
inserted into the DAG. Its partial derivatives are defined, such that the tangent model 





; du, i; «4 4P 
v4 = —— = Vva(vo, 11) « [vo, 1] 
dt 
is created by multiplying the partials along the paths connecting t to v4. ‘The SAC is transformed 
according to the rules of Equation 2.14, resulting in a program augmented by the tangent 
statements. The resulting calculation is shown on the right of the figure. It calculates the 


tangent y = v4 as well as the primal result y = va. 


Adjoint Mode of AD 


Definition 8 (First order adjoint model). 

Let f :R” > R™ with y= f(x), © € R”, and y € R” be an at least once differentiable function. 
Further let x1) € R" and yy € R™ be adjoint variables corresponding to the primal variables x, y. 
Then the first order adjoint model f(1) : R’ x R” > R" x R™ of f calculates the primal as well 
as the product of the transposed Jacobian with Ya): 


("0 = fay (a>) = i. He | : 
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V2 = V1 * V9 + UO * VY 
V2 = Ug * Vi 
U3 = V2 * U1, + V1 * V2 








V2 = U9 * UI U3 = U2 * UI 
V3 = V2 * VI V4 =1x*va+1* v3 
V4 = U3 + V2 U4 = U3 4+ V2 

(a) DAG (b) SAC (c) Augmented SAC 


Figure 2.11: Illustration of the tangent model for y = (1 - 2 )(1 + x2), with DAG annotated 
with partial derivatives on the edges (left), primal SAC (middle), and SAC augmented to 
calculate the first order tangent model (right). 





The full Jacobian can therefore be calculated at cost of O(m) - cost(f) relative to the primal 
function evaluation. For the common case of a scalar output y, the gradient can be obtained at 
cost O(1) - cost(f). Compared to the factor O(n) - cost( f) of tangent mode this lower complexity 
is the prime motivating feature of the adjoint mode. 





The subscript e(;) denotes the first adjoint direction. Similar to the tangent mode we define 
the following equivalent notation for first order adjoint models: 


:— (1) ; 


Again, the adjoint model is now applied to each assignment of the SAC. In adjoint mode, a 
forward evaluation of the original program is succeeded by the propagation of adjoints for all v; 
in reverse order, that is, fori =n+p—1,...,0: 


for 7 =n,...,n+pt+tm-l 


forward section, 
V5 = Pj (Vi)iaj 








fori=n+p-—1,...,0 (2.15) 
_ Op; _ reverse section. 
= 2 ay 8 
Jt<3 


Here the variables v are adjoint variables associated with the primal values v. The adjoint sensi- 
tivities are evaluated only after the primal elemental functions have been evaluated, propagating 





them from the outputs back to the inputs. In practice the sum resulting in v; is not evaluated all 
at once, but v; is incremented by the incoming partial derivatives one at a time, motivating the 
incremental nature of adjoint code [GWO8]. 
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V2 = VQ * UI 
U3 = V2 * Ui 
V4 = U3 + V2 





V2 = U0 * UI U3 = U4; V2 = U4 

V3 = VQ * UI Vat = V1 * U3; VY = V2 * UZ 

U4 = U3 + V2 Vo = V1 * V9; Vi+ = V9 * V2 
(a) DAG (b) SAC (c) Augmented SAC 


Figure 2.12: Illustration of the adjoint model for y = (a9 - x 1)(1 + 22). Left: DAG with adjoint 
extension s, annotated with partial derivatives on the edges. Center: SAC. Right: SAC 
augmented to compute the first order adjoint model. 


Note that the v; computed in the forward section are potentially required as arguments of local 
partial derivatives within the reverse section. They are read in reverse with respect to the original 
order of their evaluation. The additional persistent memory requirement of the adjoint code is 
O(n+p+m). This data flow reversal is the main challenge in adjoint AD. It is responsible for a 
naive implementation of AD typically not being applicable to large-scale numerical simulations. 
The available persistent memory may simply not be large enough [Nau12]. 





In Figure 2.12 we show the application of adjoint mode to the function already considered for 
tangent mode. An auxiliary variable s is inserted into the DAG. Its partial derivative is defined, 
such that the adjoint model 


ds 


1 a) = V L - 4%, 
vo, V1| aon v4 |Vo, U1] ° U4 


is created by multiplying the partials along the paths connecting [v9, v;| to s. Note how the 
forward evaluation of the program is now spatially separated from the calculation of the adjoints. 
The adjoints are evaluated in reverse order, propagating the adjoints of the outputs y = U5 
back to the inputs % = Up and 1 = v;. ‘The intermediate v2 is required to calculate v1 after 
the primal evaluation has already ended, highlighting the additional memory cost added by the 
adjoint method. As variables v; and v2 have more than one outgoing edge, their adjoint value is 
influenced by multiple paths of the data flow reversal, motivating the incremental nature of the 
adjoint propagation. The auxiliary variable v3 is not required in the reverse section as it is only 
used linearly by other assignments and thus vanishes when evaluating Equation 2.15. 
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Higher Order Derivative Models 


Derivatives of second and higher order can be obtained by recursively applying the first order 
models onto models obtained by either tangent or adjoint mode. In the following, only the second 
order models are presented, as they can be used for a variety of optimization tasks, and can 
further be used to verify adjoints versus tangents, as presented in Section 4.4. For third and 
higher order models, please refer to literature, e.g. [Nau12]. 





Definition 9 (Tangent over Tangent Model). 

Let f : R” > R with y = f(a), c € R”, and y © R be an at least twice differentiable function. 
Applying the first order tangent model to fH) (a), x) yields the second order tangent model 
fO2) (a), a2) (G2) | x) >R” x R” x R” x R° —> RXxXRXRXR, which calculates the following 
relations: 


ylPN ah" 0? F(a) - af) + Vf (a) - alt?) 
yD) 7 V f(z) a) 
yO J Vi (a)- 2? 

y f (2) 


One entry hj; of the Hessian can thus be obtained from yb) by seeding X(1) = €i, X(2) = &;, 
and X(1,2) = 0. The full Hessian can be obtained at cost n(n + 1)/2- cost (fe?) by exploiting 
the symmetry of the Hessian. For sparse matrices, the cost can be further lowered by coloring 
approaches (see Section 2.11). 


Definition 10 (Tangent over Adjoint Model). 
Applying the first order adjoint model to f(1) (yc1y, x) yields the second order tangent over adjoint 
model i (2, L, yy Yay) > IR” x R” x R x R- R” X R” X RXR, which calculates the 


following relations: 


a\ (yay: V?F(a) -@ + Vila)? yi) 
ma) _ Vil - Y(1) 

y° V f(a) a?) 

y f (2) 


2) 


The i-th row/column of the Hessian can thus be obtained from ek by seeding x2) = e,, yay =1 


and on = Q. The full Hessian can be obtained at cost n- cost ( ane Again the number of 


evaluations of i can be improved by coloring, seeding multiple directions of x‘) at once. 


Definition 11 (Adjoint over Adjoint Model). 

Applying the first order adjoint model to f,1) (4 Y(1), £ x) yields the second order adjoint over adjoint 
model f(1,2) (a £(1,2), £, (1), Y(2 9): IR” Xx R” x R x R- R X R” X R” X R, which calculates the 
following relations: 


Y(1,2) V f(x) +x L( 1,2) 

Za) | _ V f(x)" + ya) 

(2) yy: Vif (a) 2 m2) + V F(a a)" + ya) 
y f (x) 


AS 


2 Foundations 


The i-th row/column of the Hessian can be obtained from X(9) by seeding x1) = e;, ya) = 1, 
and yg) = 0. The full Hessian can be obtained at cost n - cost ( fa2))- The cost is thus identical 
to the tangent over adjoint model, and no further gain in complexity is realized by applying 
the adjoint model twice. In practice the choice of which adjoint model to use depends on the 
implementation of the AD tool and caching considerations on the executing machine |Lot16]. 


Definition 12 (Adjoint over Tangent Model). 
Applying the first order adjoint model to fH (al), x) yields the second order adjoint over tangent 


model ae (a, Z. ya) Yay) > IR” x R” x R xX R - R” X R” X R X R, which calculates the 


following relations: 


19) Ys) Vf (@) ad) + Vita)? Y(2) 
eae 
72) | V A(z) Yo) 
m f(x) 
The i-th row/column of the Hessian can thus be obtained from xg) by seeding x) =e, Uo = 1 


and yi) = 0. The full Hessian can be obtained at cost n - cost ( oa 


The adjoint over tangent model is rarely used in practice, due to the increased storage costs 
required for the backward propagation compared to the tangent over adjoint model. 


2.7.3 Tool Driven Derivative Generation 


Broadly speaking AD tools follow one of two approaches, source code transformation or operator 
overloading. Both approaches are briefly presented below. 


Source Code Transformation 


Source code transformation tools parse the primal source code and transform it by applying 
the tangent or adjoint models, Equations (2.14) or (2.15), onto the individual statements, 
producing a new source code which can then be compiled/executed and optimized by a regular 
compiler /interpreter. The generated source code looks and performs much like a code differentiated 
statement by statement by hand, but allows greater flexibility and more rapid code development. 

The source code transformation approach requires that the tool has access, and is able to 
parse, the source code. This makes the code transformation of programs written in complex, 
object oriented languages like C++ challenging. Especially templated codes, where substantial 
parts of the code are only instantiated at compile time, are not well suited to differentiation 
with source code transformation tools. Hybrid approaches are possible. For example, source 
code transformation can be applied to numerical kernels that are limited to C syntax and some 





subset of the C++ standard. ‘The remaining program wrapping the numerical kernels can then 
be differentiated by operator overloading. The resulting partial derivatives of the different code 
blocks can then be multiplied together using the chain rule. 

General purpose AD tools which use the source code transformation approach include 'TAPE- 
NADE |HP13] (C and Fortran), dec [F6r14] (C) and Tangent |Gool7| (python). Furthermore 
domain specific tools exist, e.g. DolfinAdjoint [Far+13] for the Dolfin [LW10] / FEniCS |Aln+15] 
packages for solving differential equations using FEM. 
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#Hinclude <iostream> 


struct ADtypetf 

double v; // value component 

double t; // tangent component 

ADtype(const double& v, const double& t = 0.0) : v(v),t(t) I; 
I} 


ADtype operator*(const ADtype& x1,const ADtype& x2){ 
retunn ADL ype Col. v*x2.0 Gl vee tt let.) ; 

J; 

ADtype operator+(const ADtype& x1,const ADtype& x2)f{ 
veteran, UDREwmoKer (rail a iinrert Ny ei illy, GeuEBeY) 4 10) 

Ir 


int main(){ 
ADtype x = 2; 
x.t=1; // seeding 
ADtype y = x*x+x; 
Sti COuty <0 "V6" << yy ot St dee endl 7/7 4 16) to 


t 


Listing 2.5: Implementation of a basic operator overloading tool. A custom type calculates 
tangents of programs containing additions and multiplications. Application is demonstrated 
by calculating the tangent y = 2x +1=6o0f y=2-x2+~2 at location x = 2. 


Operator Overloading 


To circumvent the issues faced by the source code transformation approach, the operator over- 
loading approach uses the operator overloading features present in many modern programming 
languages. We will focus on the implementation, as it applies to C++. Operator overloading 
allows to replace the intrinsic implementations of the basic numerical operators (+,—,...) and 
functions (sin, exp, pow), implementing custom behavior. An AD tool implements one or mul- 
tiple custom data types that are used to replace the floating point data types of the primal 
calculation. The custom data type implements the necessary operators to, in addition to the 
primal calculation, propagate tangents/adjoints through all (unary or binary) elemental functions 
occurring in the code. 


As an illustration of this concept a very basic C++ operator overloading tool, that is able to 
calculate tangents of programs containing assignments, additions, and multiplications is shown 
in Listing 2.5. It is applied to the expression y= %-r%+ 2. 

Operator overloading tools are able to, with minor exceptions, cover the whole language 
standard of C++ and are thus applicable to heavily templated code bases. Codes treated by those 
tools require only minimal code changes (some of which are mentioned in Section 3.2.3) and are 
by definition always up to date with the primal code base, as the instructions calculating the 
derivatives are generated at compile and run time alongside the primal. 





Operator overloading introduces a run time penalty, which is more pronounced in some languages 
than in others. For compiled languages, most of the overhead can be offset by compile time 
optimization, most significantly function inlining. The memory overhead of the adjoint is generally 
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more pronounced than for codes generated with source code transformation, as a representation 
of the SAC has to be stored at run time. The performance and memory consumption can be 
improved by (statement level) preaccumulation, briefly discussed in the next Section. 

Operator overloading tools include ADOL-C [WG12] (C,C++), CoDi-Pack |SAG17] (C++), 
dco/c++ |LLN16] and AdiMat [BBV06] (Matlab). 

Novel approaches are currently in development, further reducing the run time overhead 
by exploiting advanced templating features of C++, at the cost of requiring more code alter- 
ations |Lep+17|. This allows to extend preaccumulation from the statement level to whole 
code blocks. 


2.8 Introduction to dco/c++ 


In this section we will give a brief overview of the architecture and interface of the operator 
overloading AD tool dco/c++ [LLN16]. Brief example drivers, usable to obtain gradients, Jacobians, 
and Hessians are given. 


2.8.1 Scalar and vector tangent mode 
The AD tool dco/c++ implements a generalized tangent scalar data type 
template<class T> dco::gtis<T>:: type; 


which carries a tangent component of type T alongside the value component of the same type. For 
a first order tangent model, this type is instantiated with a floating type data type (i.e. float or 
double). It can be initialized with a zero tangent by assigning a primal value. 


dco::gtis<double>::type x = 42; 


The tangent type carries no further data; thus sizeof (dco: :gt1is<T>::type) == 2*sizeof(T). 
The derivative components of all dco/c++ types can be accessed by the interface routine 
dco::derivative(). Here it returns a reference to the tangent component of the passed variable. 


dco::gtis<double>::type x; 
double t = dco::derivative(x); MA const access 
dco::derivative(x) = 1.0; //¥nuon const access 


The value component of a dco/c++ type can be accessed by the dco::value() routine. ‘his 
allows to alter the value without changing the tangent /adjoint and to convert a dco/c++ type to 
a passive type, discarding its derivative information. 


dco::gtis<double>::type x; 


double v = dco::value(x); // const access to value component 
dco: :value(x) =W2.0F // non const access, tangent unmodified 
x = 42-0; // overwrites tangent with 0.0! 


Complementing the tangent scalar type gtis<T>::type dco/c++ also defines a vector type 
gtiv<T,d>::type. This vector type carries a vector of tangents t € R®% alongside the pri- 
mal value. ‘The tangent model is applied to each vector entry individually, allowing to evaluate d 
seed directions at once. ‘The vector mode reduces the number of elemental function evaluations. 
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gtiv<double ,d> sin(const gtiv<double ,d>& x){ 
gtiv<double ,d> jy; 
deo; = valwe GG) = San) ; 
double partial = cos(dco::value(x)); 
for€int i=0; i<d; itt) 
dco::derivative(y) Li] = partial*dco::derivative[i]; 
EFevurm y; 


Listing 2.6: Implementation of sin operation for tangent vectors of length d. 


To evaluate n seed directions, instead of n calls to the scalar tangent model, only |n/d] calls to 
the vector tangent model are needed. Further, it improves floating point performance, due to the 
increased cache locality introduced by the loop evaluating the individual tangents. The benefit 
due to caching varies from code to code. For maximum performance, the vector size is fixed at 
compile time and implemented as a plain array. A sensible vector size is e.g. 16. Tangent vectors 
which are too long consume excessive amounts of memory, and may decrease cache efficiency 
due to the vector not fully fitting into one cache line. ‘The size of a vector type in memory is 
sizeof (dco: :gtiv<T>::type) == (d+1)*sizeof(T). 

Accessing the values of the tangents follows the same interface as the scalar tangent type, with 
the difference that dco: :derivative returns a reference to the tangent vector, from which the 
desired element can be accessed using the usual vector operations. 





dco::gtiv<double>::type x = 42; // init tangents to zero 
double t = dco::derivative(x) [0]; // const access 
dco derivata vex iit lea oe /A n@maconst access 


An exemplary implementation of the sin operation, using the interface introduced above, is shown 
in Listing 2.6. 

The general purpose driver shown in Listing 2.7 calculates the gradient of a multivariate 
function f : R” — R, assumed to be implemented (externally) as T f (std::vector<T> x), 
requiring n calls to the function f. 

To find a sensible vector length d, the driver is tested for a problem size of n = 2'° and tangent 
vector sizes ranging from 2 to 1024. The functions considered are 


n—1 n—1 
y=) Lae ond y=) sin 2. 


The results are shown in Figure 2.13. ‘The results shown for d = 1 are obtained by the scalar 
tangent types and the curves are normalized, such that the run time of scalar execution equals one. 
The products x; - x; are computed cheaply. ‘Thus, the run time is dominated by memory lookup 
of x;, and caching effects are clearly observable. Here the optimal tangent vector size is 32, giving 
a speed up of approximately factor 2.5, compared to the scalar tangent version. ‘Trigonometric 





functions are much more expensive to compute (here by a factor of roughly 25), therefore the 
saved calls of the sin function for the primal and cos function for the partial derivative have a 
bigger impact. Again, a vector size of 32 is a good choice, however the minimum time is achieved 
by choosing a vector size of 256, resulting in a speed up factor of 30. 
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#include "dco_cpp_dev/src/dco.hpp" 
#include <vector> 
using namespace std; 


typedef dco::gtis<double>::type tis_type; 
const int vectorlssize — 16; 
typedef dco::gtiv<double ,vector_size>::type tiv_type; 


template<typename T> 
T £(€vector<I> x); 


// scalar driver 
vector<double> calc_grad_f_tis(const vector<double>& xd){ 
const int n = xd.size(); 
vector<double> grad(n) ; 
vector<tis_type> x(xd.begin() ,xd.end()); // copy passive values 
nf Oe (CFG 3,0) 3 a an 6 ak aear) al 
dco::derivative(x[iJ) = 1.0; // seed i-th unit vector 
tis_type y = f(x); // augmented primal calculation 
gradLli] = dco::derivative(y); 
dco::derivative(x[i]) = 0.0; // reset seed vector to 0 
Ir 


return grad; 


// wector driver with vector size 16 
vector<double> calc_grad_f_tiv(const vector<double>& xd){ 
const int n = xd.size(); 
COnst 1nt d= VeclLor size. 
vector<double> grad(n) ; 
vector<tiv_type> x(xd.begin() ,xd.end()); // copy passive values 
for(int i = 0; i < ceil(n/d); it+)f 
// increment through local and global indices 
for(int j=itd; j < min(€(Cita® «d@a); j++) 
dco::derivative(x[j]) [jZ%d] = 1.0; // seed g-th unit vector 
tiv_type y = f(x); //_augment@d primal calculation 
for(int j=itd; j < m@m@iti)*d,n); j++)1 


gradLlj] = dco::derivative(y)Ljjd]; // extract tangents 
dco::derivative(x[j]) [jZ%Zd] = 0.0; // reset seed vectors to 0 
i 
Ir 
return grad; 


t 


Listing 2.7: Driver calculating the full gradient of a multivariate function f : R” — R using 
tangent mode of AD. 
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Figure 2.13: Run time of the tangent vector benchmark for varying tangent vector size d. Run 
time is normalized to the execution time of the scalar tangent. 


The tangent mode of AD does not need any AD specific interface routines, other than the 
access functions dco: :value() and dco: :derivative(). 


2.8.2 Adjoint mode 
Concepts 


For the calculation of adjoints, dco/c++ uses efficient data structures to store the information 
required for data flow reversal. First, we introduce the graph data structure, with which the 
data flow reversal can be described. Second, we show how dco/c++ stores this information 
in its internal representation. ‘This will become important for the optimizations applied in 
Sections 3.4, 3.7, and 3.8. 

Every elemental assignment in the code can be expressed as a DAG, modeling the SAC generated 
by the assignment. For an elemental assignment, the output is by definition a scalar, connected 
to one or multiple inputs. The SAC, consisting of only basic unary or binary operations, can be 
conveniently differentiated with the chain rule by overloading the operators. A traditional operator 
overloading approach, e.g. applied in ADOL-C [WG12], operates on data structures similar to the 
DAG with all intermediate nodes included in the graph representation. By implementing operator 
overloading via a template expression engine, the elemental gradients of a single assignment 
in the code can be assembled during the augmented forward run by multiplying together the 
partial derivatives created by the interior edges of the DAG. This technique is commonly referred 
to as statement level preaccumulation |GWO8]. For the specific implementation in dco/c++, 
refer to [LLN16]. 

On the graph level, preaccumulation transforms the DAG into a bipartite graph connecting the 
inputs to the scalar output. ‘This significantly reduces the storage requirements for the data flow 
reversal, as no intermediate nodes and edges of the SAC need to be stored. 

An edge e € £ in the bipartite graph structure (V, £), pictured in Figure 2.14, corresponds to 
a preaccumulated derivative of an output with respect to a specific input. The values of the edge 
labels can be calculated during the augmented forward section, as they do not depend on any 
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Figure 2.14: Illustration of the adjoint model for code consisting of two elemental assignments. 
For each assignment in the original code, the individual edges created by the corresponding 
SAC can be transformed into a bipartite graph using preaccumulation. 


values created in the adjoint propagation phase. A vertex v € V represents an adjoint variable. 
During the reverse propagation phase, also called interpretation phase, the adjoint information 
is propagated through the bipartite graphs modeling individual assignments y = f(x). The 
edge labels storing the preaccumulated partials Oy/Ox; are multiplied with the corresponding 
adjoints y of the left hand side of the primal assignment. ‘The resulting products are used to 
increment the adjoints of the inputs x; which are accessible via the edges of the graph: 


m= shy Vi € {0,...,n}. 








The values of the adjoint variables are stored in a contiguous vector in RAM, called the adjoint 
vector. The preaccumulated partial derivatives are stored in a graph like structure, storing 
the partials as well as the position of the element in the adjoint vector which needs to be 
incremented during the reverse propagation phase. We call this data structure the adjoint stack, 
as it continuously grows during the augmented forward run and gets evaluated sequentially in 
opposite order during the reverse propagation phase (last in, first out). In actual implementation, 
the stack is implemented as a vector instead, to allow reinterpretation. We call the combination 
of the adjoint stack and adjoint vector the tape. 














typedef std::vector<double> adjoint_vector ; 
struct partial_edgef 

double partial_val; 

int target_idx; // entry of the adjoint vector which will be incremented 
}; 
typedef std::vector< std::vector<partial_edge> > partial_edges; 
void interpret_tape(const partial_edges& s, adjoint_vector& v){ 

for(int if— smswme()-1:; i>=0;i1--) 

for(const partial_edge& p: sLil]) 
v[p.target_idx] += p.partial_val * vlil; 

t 


Listing 2.8: Conceptual implementation of the tape interpretation. 
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Figure 2.15: Storage of preaccumulated partial derivatives and the target index in the adjoint 
vector (left), and adjoint vector (right). The preaccumulated partials are used to increment 
the adjoints during the reverse sweep. 





For each assignment in the code (or in the SAC without preaccumulation), an entry on the 
stack will be generated, storing all partial edges corresponding to this assignment. A conceptual 
implementation of the tape interpretation is given in Listing 2.8, the corresponding implementation 
of the operators generating the tape entries is given in Appendix D. 





For a graphical representation of the tape created by the SAC from Figure 2.14, as well as 
the incrementation process, see Figure 2.15. Note that the position of the variables which need 
to be loaded from the adjoint vector in order to increment a specific adjoint are not necessarily 
close together. For example, a simple iterative program dependent on a static parameter would 
need to increment the adjoint of the parameter, stored at the first position of the adjoint vector, 
for each iteration of the loop. Furthermore, one adjoint of the iteration variable x needs to be 
incremented, which will be stored at some (decreasing) distance to the adjoint of the parameter. 
This leads to random access memory patterns (which for this simple example could be handled 
by multiple cache lines, but will fail for more complex iterations) and also makes it difficult to 
offload chunks of the adjoint vector to secondary storage, as entries in multiple chunks may need 
to be incremented for the adjoint of a single expression. 

The tape only stores the preaccumulated partials for a specific state of the primals. ‘To evaluate 
the adjoints for different primal values, the tape has to be newly recorded, requiring a full 
re-evaluation of the program. Tools which store the full SAC (e.g. ADOL-C or optionally in 
CoDiPack) will only evaluate the Jacobians during the reverse interpretation sweep, allowing to 
evaluate the adjoint model for different primal states without re-recording. This approach comes 
at the expense of significantly higher memory consumption and lower performance for a single 
evaluation of the adjoint model. In contrast, different adjoint seeds can be evaluated with an 
already recorded tape. 











An example tape representation directly exported from the internal dco/c++ data structure is 
shown in Figure 2.16. This tape is created by three iterations of the Babylonian iterative root 
finding algorithm. It calculates ,/a with a = 2 from an initial guess of « = 2 using the iteration 
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(1, 0.353542) 


(2, 9.61169e-05) 


(3, 0.0017301) 





Figure 2.16: Internal representation of the tape, generated by Babylonian root finding algorithm. 


procedure x = 0.5*(a/x+x). The edge labels are the partial derivatives stored in the tape, the 
vertex labels are a pair of the location of the adjoint in the adjoint vector and the value of the 
adjoint after the reverse propagation has finished. ‘The adjoint of the parameter a is located at 





the first entry in the adjoint vector and evaluates to 0.353542 = sey The Babylonian root finding 
algorithm is discussed in further detail in Section 3.8.2. 


dco/c++ Adjoint Interface 


The basic operations dco: :derivative() and dco::value(), already introduced for the tangent 
mode, still apply for the adjoint mode. Additional routines, corresponding to the management 
and interpretation of the tape, are needed. dco/c++ provides a global tape, which is accessible 
through the global_tape pointer at any point in the program. Specialized tapes with local 
scope can also be created, however this feature is not needed in the context of this thesis. The 
global tape needs to be allocated at the beginning of the program and is alive until it is explicitly 
destroyed or the program exits. 


#include <dco.hpp> 
int main(){ 
/ 7 Soe ocataron 
dco::gais<double>::global_tape = dco::gais<double>::tape_t::create() ; 


some_calculation(); 
// destruciien 


dco::gais<double>::tape_t::remove( dco::gais<double>::global_tape ); 


t 





If not otherwise specified, the tape is allocated as a chunk tape, meaning that the adjoint stack 
does not have a fixed size and grows on demand. New space for the stack is allocated in chunks, 


o2 


1 
2 
3 
A 
5 
6 
. 
8 
9 
10 
11 
12 
13 
14 
15 
16 
if 
18 
19 
20 
21 
22 
23 
24 


2.8 Introduction to dco/c++ 


that are not necessarily adjacent in memory. An alternative is the usage of a fixed size tape, 
which yields slightly higher performance at the expense of less flexibility. 

Variables corresponding to inputs, for which derivatives are desired, need to be registered in 
the tape. At this point they will be assigned a positive tape index, uniquely identifying them to 
avoid name aliasing. ‘he tape index can be used to look up the corresponding adjoint from the 
adjoint vector. 

The registration process helps to reduce the run time and memory overhead of the adjoint 
by performing an activity analysis |HNPO5|. That is, by default variables are considered to be 
passive and do not enter the DAG with any partials. A variable becomes active, once it gets 
assigned the result of an elemental function involving an active variable. 





dco::gais<double>::type x = 21; // x still passive, dco::tape_index(x)=0 

x = 2*x; // activity analysis: x=42, dco::tape_index(x)=0 
dco::gais<double>::tape->register_variable(x); // x active, dco::tape_index(x)=1 
x = 2*x; // aliasing of x, x=84, dco::tape_index(x)=2 





Once the independent variables have been registered, the primal calculation can be executed. 
The overloaded operators will populate the adjoint stack with the preaccumulated gradients and 
update the tape indices of the variables involved in the computations. 

The aforementioned concepts are illustrated in the general purpose driver shown in Listing 2.9, 
which calculates the gradient of a multivariate function f : R” —> R, equivalent to the tangent 
driver discussed earlier. 


#include "dco_cpp_dev/src/dco.hpp" 
#include <vector> 
using namespace std; 


typedef dco::gais<double> als_mode; 
typedef dco::gals<double>::type als_type; 


template<typename T> T f(vector<T> x); 


vector<double> calc_grad_f_ais(const vector<double>& xd){ 
ais_mode::global_tape = ais_mode::tape_t::create(); 
const int n = xd.size(); 
vector<double> grad(n) ; 
vector<als_type> x(xd.begin(),xd.end()); // copy passive values 
ais_mode::global_tape->register_variable(x.begin() ,x.end()); 


ais_type y = f(x); // augmented primal calculation 
dco::derivative(y)=1; // seeding 
ais_mode::global_tape->interpret_adjoint(); // reverse propagation 


grad = dco::derivative(x); // extract derivatives of vector x 
ais_mode::tape_t::remove(ails_mode::global_tape) ; 
return grad; 


t 


Listing 2.9: Driver calculating the full gradient of a multivariate function f : R” — R, using 
adjoint mode of AD. 
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The tape persists the interpretation and can be re-evaluated with different adjoint seeds. 
Re-evaluating a previously recorded tape allows to calculate different adjoints for the same primal 
evaluation path, without the need to execute the augmented primal calculation. Due to the 
incremental nature of the adjoint propagation, it is usually necessary to zero all elements of the 
adjoint vector prior to seeding and propagating new adjoints. 


// zero all entries of the adjoint vector 
dco::gais<double>::global_tape->zero_adjoints() ; 


The contents of the tape can be reset without destroying the tape, which is more efficient than to 
destroy and re-allocate the tape repeatedly. 





// delete tape entries and corresponding adjoint vector 
dco::gais<double>::global_tape->reset_adjoints () ; 


For the tape routines interpret_adjoint, zero_adjoint and reset, corresponding routines 
that allow to operate only on a subset of the tape, exist. ‘The current position in the tape can be 
queried with get_position. 


dco::gais<double>::position_t to 

= dco::gais<double>::global_tape->get_position(); 
// ao sth 
dco::gails<double>::position_t from 

= dco::gais<double>::global_tape->get_position() ; 


dco::gais<double>::global_tape->interpret_adjoint_to(to) ; 
dco::gais<double>::global_tape ->interpret_adjoint_from_to(from,to) ; 
dco::gais<double>::global_tape ->zero_adjoint_to(to) ; 
dco::gais<double>::global_tape ->zero_adjoint_from_to(from,to) ; 
dco::gais<double>::global_tape->reset_adjoint_to(to) ; 


Listing 2.10: Operations can be restricted to a part of the tape. 


2.8.3 Higher Order Derivative Models 


For higher order derivative models, the tangent and adjoint data types, as well as the access 
routines can be recursively nested. This will be demonstrated in Section 3.6, for now we will 
focus on first order data types. 


2.9 Gradient Based Optimization 


2.9.1 Steepest Descent 


One of the most intuitive and popular choices for gradient based optimization is the method 
of steepest descent. Here we focus on our CFD optimization setting with parameters a € R"¢, 
consisting of pre-processing, processing, and post-processing. ‘The chaining of the processing 
functions requires the use of the total derivative. From a current parameter state a’, the next 
state is determined by moving the state a distance \’ in the (negative) direction of the gradient 
of J w.r.t. a: 

ait! — gt — VOT (a‘) . 
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2.9 Gradient Based Optimization 


As the gradient of a function points in the direction of steepest ascent, a reduction in the cost 
function can be achieved by moving the state in the opposite direction. 


The optimal step size \’ can be chosen by performing a line search along the gradient direction 
and finding the A which minimizes the cost function: 


ae min J (a’ —A-VaJ (a’)) . 


This creates an additional optimization problem, albeit only for the scalar parameter A, requiring 
additional cost function evaluations as well as derivatives d7(a’ — A-VaJZ) ; dA. As Aisa 
scalar, the derivatives can conveniently be calculated with tangent mode or FD. 


To avoid the cost and complexity of calculating additional derivatives, a pragmatic approach is 
to choose a somewhat arbitrary value for 4, ensuring that it improves the cost functional, that 
is J(a't') < J(a’). A popular implementation is the bisection algorithm. Starting from an 
initial guess, A is iteratively divided, until an improvement in the cost function is achieved. The 
bisection approach is illustrated in Algorithm 1. 


Algorithm 1: Bisection line search algorithm. 


Input: previous parameter state a’ 
Data: gradient VaqJ, initial line search stepsize Astart 
Output: new parameter state a’t! 


1A A start ) 

2 while J(a’ —\VaJ) > J(a’) do 
3 | A+ d/2; 

4 end 

5 a’tle g' AVE : 


For ||VaJ|| 4 0, the algorithm is guaranteed to terminate, as for lim)_,9+ a reduction must 
be achieved, else —-VqJ would not be a descent direction, contradicting the definition of the 
gradient. No additional derivatives are required for the bisection. ‘Thus, the additional cost 
function evaluations can be performed in passzve mode, as the gradient d/ / da’ is considered 
fixed during the line search. 





To ensure convergence, the step size can be chosen such that the Wolfe conditions |Wol69] are 
fulfilled, adding further complexity to the line search algorithm. 

As an example, showcasing the limitations of steepest descent, Figure 2.17 illustrates the 
convergence of the steepest descent algorithm for the Rosenbrock function |[Ros60| 


f(x,y) = (1 — x)* — 100(y — x*)° 


with a fixed step size. Finding the optimum of the Rosenbrock function is notoriously difficult for 
eradient based optimization methods, due to the long valley with only a low gradient leading to 
the optimum. Consequently the steepest descent algorithm needs several thousand iterations to 
converge from the chosen starting position to the optimum at (1,1). 
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Figure 2.17: Iteration history of the steepest descent algorithm (green trajectory) and Newton’s 
algorithm (red trajectory), from starting point P = (—1,3) to the optimum at O = (1,1) of 
the Rosenbrock function. 


2.9.2 Newton's Method 


With the additional information of curvature available from the Hessian H = V? f(x), an improved 
iteration called Newton’s method can be constructed. 


xitl = x! = (V? f(x'))~ ; V f(x’) 


=Z 


This method allows to find (local) minima characterized by V f(x) = 0. Due to the cost and 


stability issues involved in the explicit calculation of the inverse Hessian (Vv? i (x'))", in practice 
an equivalent two stage procedure is used, where the update step is calculated by the solution of 
a linear system. 


Vf (x') a= VF(x') > 2=S8(V'f(x'), VE(X)) 
xi x! 7 

For certain optimization problems, Newton’s method can dramatically reduce the number of 

iterations needed, outweighing the additional complexity required to obtain the second order 

derivative information. For example, the Rosenbrock example in Figure 2.17 only needs four 

iterations of Newton’s method to converge to machine precision, compared to thousands of 

iterations with steepest descent. 

Newton’s method can be shown to converge to the solution quadratically starting from a point 
within an interval to the solution. However, convergence is not necessarily guaranteed from all 
starting points |Deul1]. It is sometimes advisable to initialize the solution with a number of 
steepest descent iterations to get into the region of convergence of Newton’s method and then 
switch to the faster converging Newton’s method. 
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2.10 Discrete Adjoint Residual Approach 


For problems of high dimension, where the assembly of the Hessian is considered too costly or 
complex, Quasi-Newton methods can be used. Quasi-Newton algorithms iteratively construct 
better approximations for the inverse Hessian, retaining some of the convergence properties of 
the Newton’s method. A popular Quasi-Newton method is the BFGS algorithm which has been 
implemented in multiple variations |Bro70; Fle70; Gol70; Sha70]. 


2.10 Discrete Adjoint Residual Approach 


Many researchers in the past already implemented an approach which only requires the residuals 
of the FVM discretization systems to be differentiated. Most just call it the discrete adjoint 
approach |Mav07; RU13; NJO7; He+18; Gil+-03]. To avoid confusion with the algorithmic 
discrete adjoint, applied to the whole non-linear iteration step, we will call it the descrete adjoint 
residual approach. 

For the usual (laminar) flow state in discretized form x = (U, p), convergence of the governing 
equations implies that the residual R of the Navier-Stokes equations is (near) zero 


R (x(@),a) = 0. 
Thus, the total derivative of the residual can be expressed as 


dR OROx OR 
da anda Joa” (2.16) 


The sensitivity of the states w.r.t. the parameters can be calculated by transforming (2.16) into 


Ox _ AE) = (9.17) 


das \ ax} Oa’ 





The desired sensitivities d7/da@ can be calculated by inserting (2.17) into the total derivative 
for J: 


(2.18) 


da Ox\dx) da’ 


dJ(x(a),@) OF | AJ Ox _ OJ _ IIT (IR\~ AR 
da 0a Oxda da Ox 


Defining 


* Ox \ Ox 


one can eliminate the inverse operation from (2.18) by solving the linear equation system 


aR\* aq\* 
Se A= (5) 


for Ax and substituting the result into (2.18) 


ar oF (oR) 





(2.19) 


dT (x(a),a) OF — (oR) 
da — Oa Oa = 


When using the adjoint mode of AD, the (matrix-vector) product (OR/da)* -Ax can be evaluated 
without explicitly calculating the Jacobian OR/0a, by exploiting that the adjoint model allows 
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to calculate the projection of the transposed Jacobian in an arbitrary direction. Therefore, Ax 
can be chosen as a seed direction. ‘The adjoint model then calculates the desired matrix-vector 
product. The explicit dependence 07 /Oq@ of the cost function on the parameters is often zero, 
or else can be calculated cheaply with one evaluation of the adjoint model. The costly operations 
are therefore the calculation of the Jacobian of the residual w.r.t. the state OR/Ox, and the 
solution of the linear equation system. 

The linear system can potentially be solved using a iterative matrix free solver (e.g. BiCG). 
Matrix free solvers, in contrast to regular solvers, only require a mean to evaluate matrix vector 
products of the matrix with arbitrary vectors. AD allows the efficient evaluation of the (transposed) 
Jacobian vector product. Using a matrix free solver the full Jacobian neither has to be evaluated, 
nor stored. However, due to the poor condition of the linear system, in practice matrix free 
solvers are usually not a feasible option. Using matrix free solvers, the problem can not be 
effectively preconditioned and will not converge well. Therefore, the full (sparse) Jacobian has to 
be calculated, either in adjoint or tangent mode, or using FD. 

Note, that some authors calculate the Jacobian with FD and still call their implementation 
discrete adjoint, in reference to the adjoint equations (e.g. [He+18]). 

To effectively calculate the sparse Jacobian OR/Ox, coloring techniques can be used, lowering 
the complexity from O(nx), to O(d), where d is the maximum number of cells which influence 
the discretization of a cell, that is 








J Ox; 
=0 


Coloring techniques are presented in the following section. 
The discrete adjoint residual approach has been implemented in the discrete adjoint OpenFOAM 
framework, along with the necessary coloring heuristics. Details are presented in Section 4.5. 














2.11 Matrix Coloring 


2.11.1 Jacobian Compression 


The calculation of sparse Jacobians and Hessians can be considerably sped up by exploiting the 
sparsity and orthogonality of rows and/or columns. 


Definition 13 (Structural Orthogonality). 
Two vectors v,w € R” are structurally orthogonal if and only if at each index at least one of the 


vectors is zero, that is 
n—1 


S (vjw)? = (0. 

i=0 
The tangent model computes the multiplication of the Jacobian J = Vf in a direction x, the 
adjoint model multiplications of the transposed Jacobian in direction x. If multiple columns of 
the Jacobians are pairwise structurally orthogonal, all non-zero entries of those columns can be 
computed with one evaluation of the tangent model, by superimposing the unit vectors which 
would be used to calculate the corresponding columns. 

Similarly all non-zero entries of structurally orthogonal rows can be computed with one 

evaluation of the adjoint model. 
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Definition 14 (Jacobian column compression). 

Let J be the Jacobian of a function f :R” > R™, with associated row and column index sets LT 
and J. Let Jo C JI bea set of columns indices for which the columns of the Jacobians are 
pairwise structurally orthogonal, that is 


Jij » dijo =0 Wed, nsJ2E€ Jon F ja - 


Thenv = J: ee Jo e:) ae To J&i Contains all non-zero elements of the columns corresponding 
to the indices Jo. 


Definition 15 (Jacobian coloring). 

A Jacobian coloring groups rows/columns of the Jacobian into sets of structurally orthogonal 
rows/columns. We refer to those sets as colors of the Jacobian. Each row/column is assigned a 
color, and is thus included in exactly one set. 


Definition 16 (Row Seed Matrix). 

A row seed matrix is a matric Sp € {0,1}°%" with SX) S;; = 1 for alli = 0,...,¢. The 
individual rows of the seed matrix Sr can be used as seed vectors for the adjoint model of AD, to 
compute a row of the compressed Jacobian Jo = Sr- J € R&*”. 


Definition 17 (Column Seed Matrix). 

A column seed matrix is a matrix Se € {0,1}™** with a be = Jor alg = Ovcege. Tie 
individual columns of the seed matrix Se can be used as seed vectors for the tangent model of AD, 
to compute a column of the compressed Jacobian Jg = J- Se € R™*°. 

Definition 18 (Bipartite row/column graph of a matrix). 

The bipartite graph of a matrix A is defined as G = ((R,C), FE). Each row of the matrix A 
corresponds to a vertex Tr; in the set R. Each column of the matriz A corresponds to a vertex c; in 
the set C. An element r; of the set R is connected to an element c; in C by and edge (r;,c;) € E 
if and only if the matria entry Aj; 1s non-zero. 


The bipartite graph of a matrix can be partially or fully colored to obtain a feasible Jacobian 
coloring [GMP05]. 
We will briefly present the previous concepts on the following example matrix A € R**%* with 
8 non-zero entries. 
Q00. 6 (A001 0 0 
Mes 0 Q11 0 Q13 
0 0 Q22 493 
Q30 0 0392 0 


A possible column coloring for the example is C = {{co,c3}, {c1, c2}}, a possible row coloring is 
R = {{ro,r2},{r1,7r3}}. The resulting seed matrices are 


Se = 


1 0 1 O 
and sa = |j 1 0 if 


=- CO OO 
Oo KF KF CO 


o9 


2 Foundations 





Figure 2.18: Bipartite graph representation of non-zero pattern of matrix A, partial distance-two 
row coloring, partial distance-two column coloring. 


The resulting compressed Jacobians, from which all nonzero entries can be reconstructed, are: 


ago ao1 

a13 11 a00 G01 422 423 
Ac = ASc = and Ar =SRA= 

a23 22 a30 G11 432 13 

a30 432 


The bipartite graph for the matrix, as well as the feasible partial row and column colorings, given 
above, are shown in Figure 2.18. 


Lemma 1. 
A valid partial distance two row coloring on the bipartite graph solves the row compression problem, 
allowing to directly recover the nonzero entries from the compressed Jacobian [GMPO05]. 
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3 Differentiation of Complex Iterative CFD Algorithms 


In this chapter the differentiation of complex iterative CFD algorithms is discussed. Particular 
emphasis is placed on the efficient implementation of the developed discrete adjoint methods in 
OpenFOAM. First we will focus on a black-box application of AD and then later implement 
various improvements. 





3.1 Foundations 


3.1.1 Optimization Problem 


In the following we define the general optimization problem motivating most of this thesis. 
We consider a general set of parameters y € R?. For topology optimization, we call this 
set y = a € R”-, for shape optimization y = GB € R”?r. 

Using these parameters, we define an optimization problem over a calculation consisting of 
three distinct phases: 


Pre-processing P: Initialize the solver and create an initial state x? € R™. 
Processing F: Solve the underlying ODEs (iteratively), to obtain final state(s). 
Post-Processing 7: Create a scalar output from the final state(s) by evaluating a cost function. 


The full chain y = 70 FoP needs to be differentiated in order to obtain the gradient se. 
Optimization can then be applied to +y, in order to alter the solution and improve the cost 
function output, as calculated by the post-processor. ‘To keep parameters in a feasible range, 
we assume box constraints for the individual parameters 7;. In practice these can be commonly 
replaced by a global lower and upper bound. 


minimize (7+) 
subject to 1; <y<u;,7=0,...,p—1. 


Depending on the region of application, different combinations of unsteadiness are possible. We 
introduce feasible definitions for the functions P,*, and 7 for varying degrees of unsteadiness 
below. Ducted flows are often laminar and steady, while external aerodynamic flows often exhibit 
varying levels of transient effects. 


Steady Data, Steady Parameters 


This is the most common case for optimization of flows. It assumes that the simulation converges 
to a steady state. We denote the final state, reached after k iterations, as x* to emphasize the 
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fixed point nature of the iteration. 


P:R? > R™ x! = P(y) 
FeR™ x R? > R™ x* = F(x, y) = fP(x* ty) 0...0 f2(x!,y) 0 ft(x®, 7) 
JT: R™ xR? OR = fi ey) 


For transient problems, the formulation of the parameters and cost functions can incorporate 
varying levels of unsteadiness. 


Transient Data, Steady Parameters 


The scalar cost function depends on states from potentially all & time steps, but the parameter 
set is kept constant during the time iteration. Such a formulation is e.g. useful to get a mean 
value of an oscillating phenomenon, using the parameters to reduce the fluctuations. 


P:R? > R™ x" = P(y) 
F:R™ x RP > R™** (x!) ...,x°) = F(x, y) pf" (x 7) 0... 0 fi (x, 4) 
J R"™** x R? >R y= J(x",...,*',7) 


Transient Data, Transient Parameters 


As before, the scalar cost function depends on states from potentially all k time steps, but the 
parameter set is allowed to change between time steps, expanding it from -y € R? to R?**. 





P:R? > R™ sf =P (ag 
F RR x RPX® _, Rmxxhk Oi, 9x") = fh (x® ty") 0...0 ft (x? 41) 
TR ™** x RPX* _R y M7 (x",...,x',7",...,7') 


3.1.2 Reference Cases 


In the following sections we will discuss the basic concept of AD in the context of CFD solvers. 
To illustrate the implementation of those concepts, they are introduced to the OpenFOAM CFD 
framework. Whenever we need a practical example we will refer to the following two test cases, 
which are well suited for topology optimization. 


Angled Duct 


The first test case is a 2D geometry of a 90 degree bend, depicted in Figure 3.1. The geometry is 
stated dimensionless with a characteristic length of L. 

Two pipes of diameter L and length 2L are connected to a square of edge length 3L. The flow 
enters from the lower left and leaves on the top right. A constant velocity profile is prescribed at 
the inlet, a zero gradient condition is applied at the outlet. ‘The outlet pressure is fixed to zero 
with a zero gradient condition at the inlet. Walls are modeled as no-slip. A vortex forms in the 
upper left corner. ‘The vortex is induced by the shear forces, created by the fluid which flows 
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OL 


4L 


2L 





Figure 3.1: Geometry of the angled duct case. Inflow on the lower left, outflow on the top right. 
Blocks of the structured mesh are indicated in light gray. ‘The geometry is symmetric along 
the dashed diagonal line. 


from the lower pipe into the upward facing pipe. ‘This vortex region is an obvious location for 
optimization, aiming to reduce the power loss in the system. For higher flow velocity, a second 
vortex region forms in the lower right corner. A structured mesh is created by blockMesh (see 
also Section 4.6). It can thus easily be scaled to different mesh resolutions. 

At low Reynolds numbers, the use of a turbulence model is optional. For higher Reynolds 
numbers, the k-w turbulence model is used. 

Different mesh refinement levels, starting from the coarsest possible mesh for this geometry, up 
to over 200 000 cells, are listed in Table 3.1. The mesh levels below refinement level 15 are hardly 
useful to obtain a realistic solution. However, they might be used to inspect sparsity patterns and 
the discretization. The corresponding blockMeshDict.m4 file is listed in Appendix C.10. The 
mesh resolution can be controlled by a GNU m4 parameter, creating the final blockMeshDict for 
the mesher. If not otherwise specified a refinement level of 30, resulting in nc = 11700 cells, 
is used. 





Table 3.1: Refinement levels of the angled duct case. Level 1 is the coarsest possible mesh, finer 
meshes are obtained by uniformly refining the block edges. 


Level NC Level NC 
1 13 15 2925 
2 52 30 11 700 


o 329 60 46 800 
10 1300 120. = 187200 
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Figure 3.2: Dimensions of the Pitz-Daily example. Inflow on the left, outflow on the right. 
Dimensions in mm. 

















Figure 3.3: Geometry of the Pitz-Daily test case, consisting of 13 blocks (bold black lines) 
meshed with nc = 12 250 cells (gray lines). 


Pitz-Daily Case 


The second case is the flow over a backward facing step. ‘This problem, or variations of it, have 
been extensively studied and used for verification in a variety of disciplines, in particular for 
the verification of turbulence models |Rum12]. This particular geometry was first introduced 
in [PD83] and is consequently referred to as the Pitz-Daily case. It is included as a standard 
tutorial and verification case in OpenFOAM. The geometry is shown in Figure 3.2. The flow 
enters from the left, passes the step and leaves the domain through the outlet at the end of the 
nozzle shaped exit. 





At the inflow a constant velocity of 10m/s is prescribed (Dirichlet condition) with a zero 
gradient condition on the pressure (Neumann condition). At the outflow a zero pressure, as 
well as a zero velocity gradient condition, are applied. At the walls a no-slip condition is used. 
The boundary conditions for the turbulence are calculated using the kqRWallFunction for the 
turbulent kinetic energy k, and epsilonWallFunction for the turbulence dissipation rate €. ‘The 
resulting Reynolds number of Re = 25400, calculated width the inlet width as the reference 
length, as well as the solution singularity expected at the step, makes the use of a turbulence 
model advisable. Else an accurate and steady solution will not be obtained on coarse meshes. 
The mesh is again obtained by blockMesh, using the parameters provided by OpenFOAM. The 
mesh is slightly graded towards the walls, to obtain better turbulence resolution. A planar view 
of the mesh for the configuration with ng = 12250 cells is shown in Figure 3.3. 











3.1.3 Power Loss Cost Function 


For both reference cases, we utilize the power loss, induced by the pressure loss inside the system, 
as the objective. Other cost functions are possible, e.g. lift and drag are defined in Chapter 5. 
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The power loss is calculated as the difference between the (total) pressure integral on the inlet 
and outlet boundaries of the flow domain, multiplied by the flow velocity: 


ee | - (p+ Slul?) u-nar+ | 
[, Z I 


inlet 
The integrals have opposite signs, as the scalar product u-n yields a positive flux for flow 
entering the domain and negative flux for flow leaving the domain (normals are defined to be 
facing inwards). 
On the discretized result the integrals become sums over the boundary patches. As the flux @ 
is defined on the cell faces, they are used instead of the cell centered velocities, which would need 
to be interpolated to the faces first. 


f= -és(or+5 (4) ) + S -és(or+5 ($4) J. 


FEET oie5 FEFD ile 


1 
= (>+ sllul?) u-nd. 


outlet 








Like in the integral formulation, the results of the two sums have different signs, due to the 
different sign of the fluxes @ for inlet and outlet faces. 


3.2 Differentiation of Complex Simulation Software 


3.2.1 Motivation for Application of Tool Driven AD 


Software in general and especially in computational engineering can be very complex. The 
growth of complexity in software systems is believed to be exponential in time |Leh96; Dvo09], 
corresponding to the exponentially growing capabilities of computing hardware, according to 
Moore’s Law |Sch97]. Specifically, a general purpose tool like OpenFOAM, which is not tailored 
to a very specific and narrow domain and use case, will require a very large and complex code 
base. For illustration, refer to Figure 3.4, which shows the complex linkage pattern between the 
elemental OpenFOAM libraries. If one wants to retain the scope and broad applicability of the 
primal software, then as much as possible of the primal code base has to be covered by AD. 

Table 3.2 shows the growth of the OpenFOAM code base during the last few releases. ‘The 
developers of primal OpenFOAM have recently adopted a biannual release cycle. ‘This makes an 
efficient upgrade path, which brings the adjoint version quickly up to date with the primal version, 
paramount. The code base is distributed over thousands of code files, the logic abstracted behind 
many layers of class inheritance, templatization, and function macros. ‘This, as well as the usage 
of advanced C++ language features, makes the application of current source code transformation 
tools only applicable to limited numerical kernels. In order to cover the whole package by AD, 
the application of an operator overloading tool is the only viable approach at this point. 

The number of changes required to the code base for the application of operator overloading 
using dco/c++ is summarized in Table 3.3. The changes are broken down to the sub libraries 
contained in the OpenFOAM src/ directory. As can be expected, the number of changes roughly 
correlates to the amount of code included in the specific libraries, with the OpenFOAM core library 
and the finiteVolume library receiving most changes. ‘The high number of changes in the 
Pstream library stems from the addition of the AMPI library sources and not from major changes 
in the existing code base. ‘The specifics of the code changes required are addressed in Section 3.2.3. 
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libdistributed 


libOpenFOAM 
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libsurfMesh 
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libthermophysicalProperties 














libcompressibleTurbulenceModels 


libfinite Volume 


libmeshTools 
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libreactionThermophysicalModels 


libextrudeModel 


diiron ¢ 5) libfluidThermophysicalModels 
\ ™ libfvOptions 


libdecompositionMethods 





libfiniteArea 
libsampling 





libconversion 





Figure 3.4: Library dependency within the OpenFOAM framework of the simpleFoam solver 
(blue node). Blue edges indicate libraries directly linked to the solver, gray edges linkages 


within the OpenFOAM framework. 


Table 3.2: Growth of OpenFOAM code base from version 3.0 to 17.06-plus. src/ contains 
the OpenFOAM libraries, applications/ contains the individual tools and solvers (e.g. 


simpleFoam). 


3.0 C++ 
C++ Header 


3.0-plus C++ 
C++ Header 


16.12-plus C++ 
C++ Header 


17.06-plus C++ 
C++ Header 
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src/ 
files LOC 
2878 439k 
3118 201k 
2995 457k 
3218 208k 
3188 498k 
3423 225k 
3281 515k 
3529 231k 


applications/ 
files LOC 
884 133k 
908 ATk 
909 141k 
949 A8k 
879 132k 
971 ATk 
895 134k 
1023 A8k 


3.2 Differentiation of Complex Simulation Software 


Table 3.3: Comparison of number of files and lines of code (LOC) between stock OpenFOAM 
and discrete adjoint OpenFOAM. 


Library Name Identical Files Changed Files Identical LOC Changed LOC 
OpenFOAM 1501 102 173596 768 
finiteVolume 1103 12 81077 300 
mesh Tools 389 7 61060 71 
lagrangian 673 22 58431 61 
dynamicMesh 219 7 98421 iy 
thermophysicalModels 5900 8 59950 30 
mesh 148 4 37218 5 
TurbulenceModels 236 19 28052 67 
sampling 157 5) 27274 30 
functionObjects 254 5 26490 11 
conversion 68 1 14763 1 
regionModels 146 5 13845 8 
surfMesh 95 1 9609 2 
parallel 49 4 9287 10 
fvOptions 98 2 7900 4 
fileFormats 83 3 7100 11 
fv MotionSolver (e 1 6227 3 
edgeMesh 39 0 5463 0 
rigidBodyDynamics 94 0 A773 0 
six DoFRigidBody Motion D3 0 3684 0 
triSurface 31 2 3026 10 
OSspecific 34 0 3350 0 
combustion Models D3 4 S001 9) 
ODE 35 1 2974 3 
transport Models 44 1 2720 1 
randomProcesses 28 2 2221. 2 
dynamicFvMesh 17 it 1802 1 
renumber 19 0 1595 0 
engine 25 1 1540 3 
topoChangerFvMesh 15 0 1532 0 
genericPatchFields 8 0 1469 0 
Pstream 2 19 1208 3259 
fv AgglomerationMethods 6 0 999 0 
rigidBodyMeshMotion 4 0 604 0 
regionCoupled 2 0 ATA 0 
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3.2.2 Differentiating a Complete Steady Iteration History 


Without assuming any additional knowledge about the intrinsics of CFD and iterative solution 
methods, one can tackle the task of calculating the gradient of a cost function 7(x):R"” >~ R 
w.r.t. the parameters 7y by algorithmically differentiating the whole computer program, which 
implements the calculation of x and 7. This is commonly called black-box approach |Nau12], as 
no knowledge about the inner working of the model is required, as long as the inputs, outputs, 
and parameters of the model are well defined. ‘The black-box approach assumes differentiability of 
the implementation at all locations of interest. With FD one can calculate such a gradient, even 
without having access to the source code implementing the model, by perturbing the parameters 
and observing the change of the outputs. ‘This is especially useful when differentiating through 
functions which are only available as pre-compiled libraries (e.g. due to licensing and intellectual 
property concerns). 

In the following, we assume the parameters 7y to be the momentum penalty terms a of topology 
optimization, as this will be used in the illustrative examples. ‘The same statements apply virtually 
unchanged for the general case 7 € R?. Further, we assume the case to converge to a steady 
solution, as introduced in Section 3.1.1. ‘The final state of the problem is obtained by repeatedly 
applying functions f’(x’~!, a) : R"**"¢ > R”™ to the initial state, which models the propagation 
of physical quantities in time. 

For steady problems, the state converges to a fixed point x* at which point the variation Ox /Ot 
of the solution over time is zero, even if calculated with a transient solver. The series of functions 
f’ also converges to a function f*(x*) = limp... f*(x*). For black-box differentiation, these 
fixed point properties are not exploited, however they will become important when using reverse 
accumulation and piggybacking in later sections. 

The gradient of the cost function y = J(x") after k iterations w.r.t @ is defined by the chain 
rule as: 





k; 


aT _ > aS afi \ aft , ag 


Ox* | OrI-1 | Oa Oa’ 





(3.1) 


where f° = P(q@) is the preprocessing step which creates the initial state x°. For k > oo, this 
gradient is the desired gradient of the fixed point: 


DE ae, es de 

Figure 3.5 shows the DAG calculating y from x° and q@ by applying iteration steps f!, f?, 
and f%. This iteration will more than likely not fully converge to a fixed point, however the 
derivatives of this partially converged state can still be evaluated according to Equation (3.1). 
The same expression can be constructed from the DAG by summing up the multiplication of 
partial derivatives along all paths in the DAG which connect @ to y. The summation on the right 
of Figure 3.5 illustrates this procedure for the given three iteration example. The sum is written 
to illustrate the pathwise summation, obviously it could be calculated with less operations by 
using distributivity. 

For the converged solution x* of a steady case, the value of the final state does not depend on 
the chosen starting point x° anymore. (At least in the sense that dy i Ox = 0. For non-convex 
spaces, the solution might converge to a local minimum, the location of which depends on the 
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Figure 3.5: DAG of a three step iteration. The derivative d7/da@ can be determined from the 
graph by multiplying and summing over paths from @ to y. 


starting point x9. In the following we assume that a unique solution exists and that it can be 
found by solving the Navier-Stokes equations iteratively from an arbitrary starting point x°). 

The vanishing influence of x° can be derived from the following argument: If the series of 
functions f’ is contractive (it converges to a fixed point), then the norm of the individual 
Jacobians Of? / Ox'—! is lower than one for any matrix norm: ] Of? / Ox’! ] <1. The norm of 
the full derivative accumulated by the chain rule is therefore zero in the limit case: 





k; 






































Jim, | LL xi || S (max | il) “== 0. 
<1 


The limit still holds if not every function is contractive, but the number of non-contractive 
functions is finite. 

With the same argument it can be shown that dJ/da@ converges to a fixed point. Assuming 
convergence of the primal state, both terms 07 / Ox® and Of / Oa in (3.1) are bounded. Let us 
denote these bounded terms as cx and cq. Then d//da@ is bounded by 


= 
fia < [eos UE a 


1=0 -9=7-+-1 





With max; ] of 7 Ox’! ] < 1 the sum is bounded by a geometric series, which is known to 
converge to a finite limit for k > oo. 


69 


3 Differentiation of Complex Iterative CFD Algorithms 


10-2 ——s Residual of p 
— Residual of x° 
— Change in sensitivity 





107-8 


10—14 4 


Normalized residual 





10 
100 200 300 400 500 600 700 800 900 1000 


Iteration 


Figure 3.6: Iteration history of the primal and adjoint residuals for the fully converged angled 
duct case, evaluated over 1 000 iteration steps. 





The independence on the starting point implies, that any errors made in the early stages of 
the procedure do not influence the outcome of the simulation x*, as long as the iteration scheme 
is robust enough to still converge to the correct solution x* for the perturbed trajectory. ‘The 
adjoints will accumulate some errors, due to the addition of the partial derivatives along all paths 
from a to y, however those errors are minor, due to dy A dx’ < 1 for the early stages of iteration. 
This allows to save some iteration time, by allowing the residuals of the inner linear systems to be 
higher for iterations which are still distant from the solution, therefore needing less linear solver 
iterations. In OpenFOAM this concept is called relative tolerance. Relative tolerance specifies 
that instead of solving to a specified absolute tolerance, the equations are solved such that the 
final residual is reduced by a specified factor from the initial residual. ‘The initial residual will 
shrink as the outer iteration progresses towards the solution x”. 

The evolution of adjoint x can be used as a convergence criterion, which indicates that the 
adjoint & has converged (alternatively also the absolute change in @ can be observed directly). 
The adjoint x holds the derivative 07 / Ox’ as the adjoint propagation steps backward through 
the iteration loop from z=k toi = 1. 

Figure 3.6 shows how d/ ; dx’ and the increments to @& shrink, as the interpretation progresses 
backwards through the iteration history. For reference, the residual of the (forward) pressure 
equation is shown as well. The change in sensitivity after 700 iterations occasionally falls below 
the output precision and is rounded to zero, breaking the logarithmic scale of the figure. ‘Thus, 
this curve is omitted after 700 iterations. ‘The results were obtained on the angled duct case 
shown in Section 3.1.2. 

In Figure 3.7 the iteration procedure is stopped with only partially converged primals. When 
stopping the iteration after only 50 iterations, the residual x° drops only by a factor of 20, 
indicating that the solution is not yet sufficiently independent of the starting value. This results 
in an offset in the adjoints compared to the fully converged case. 
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Figure 3.7: Iteration history of the normalized adjoint residual x (left) and the corresponding 
sum of sensitivities (right). Early termination of the primal iteration leads to high adjoint 
residual and wrong adjoint sensitivities. 


3.2.3 Introduction of AD into the OpenFOAM Code Base 


In order to implement AD by operator overloading, all floating point variables which le on a path 
in the computational graph between inputs and outputs of the program, have to be replaced with 
an instance of a different data type, that allows to track the derivatives as well as the primals 
(see Section 2.7.2). For our goal of differentiating a big software package, such as OpenFOAM, it 
is hard to know in advance if a variable will influence the derivative of the desired output variables 
in any way. Furthermore, OpenFOAM, despite being a highly sophisticated C++ code, that heavily 
uses templating, is not designed for multiple scalar types to coexist at the same time. Thus, we 
chose to exchange the type of all floating point values with a dco/c++ data type. OpenFOAM allows 
to choose the floating point datatype between double and float. This is enabled by the following 
central typedef in src/OpenFOAM/primitives/Scalar/doubleScalar/doubleScalar.H, which 
allows to exchange the data type for all floating point variables. 








typedef double doubleScalar; 


At this location we can insert the dco/c++ data type, for example for first order adjoint mode: 


typedef dco::gais<double>::type doubleScalar; 


and for first order tangent mode: 


typedef dco::gtis<double>::type doubleScalar; 


The definitions for higher order data types are listed in Section 3.6. 
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In addition to the type change, certain common modifications to the code are required to 
circumvent some limitations of the C++ language. One occurrence is the implicit casting of floating 
point values to integer types. ‘Though considered to be bad style, this is permissible by the C++ 
standard and is used in the OpenFOAM code occasionally. There is no default implicit cast from 
the non-primitive dco/c++ data types to primitive types and dco/c++ also does not provide them 
by copy constructors or assignment operators, due to the potential of inadvertent (derivative) 
data loss. ‘Thus, the passive floating point value needs to be extracted from the dco/c++ data 
type manually first, and can subsequently be implicitly or explicitly cast to an integer. 


dco::gais<double>::type x = 42; 
j/ Aails. ante 2 = x: 
int i = static_cast<int>( dco::passive_value(x) ); 


The C++ implicit conversion rules allow for the following conversions in that order: 


1. Zero or one standard conversion sequence, 





2. zero or one user-defined conversion, 
3. and zero or one standard conversion sequence. 


This can become problematic, when the chain of type conversions becomes elongated by 
the additional complexity of the dco/c++ data types. For example, the conversion chain 
double -> Foam::scalarField -> Foam: :dimensionedScalarField is permissible. However, 
the chain double -> scalar -> Foam::scalarField -> Foam::dimensionedScalarField is 








not, because the user-defined conversion is already used up by the implicit conversion from the 
double literal to the (dco/c++) scalar type and is not available for the conversion from scalarField 
to dimensionedScalarField anymore. Thus, the compiler has to be explicitly told that the 
double literal is supposed to be a scalar. This problem, and its solution, is also illustrated in the 
following listing. One can construct an instance of class B from a primitive double data type 
with implicit conversion double -> A -> B, however the conversion double -> ScalarType -> 
A -> Bis not possible. 


class ScalarTypef 

double val; 

ScalarType(double val) : val(val){} 
+; typedef ScalarType scalar; 


class A{ A(Cscalar d)q} Fe 
class B{ BCA a){} }; 


int main(){ 
Wi (oles B(42.0); with scalar=double 
// fails: B(42.0); with scalar=ScalarType 
B(ScalarType(42.0)); // correct for both types 
Ir 


Listing 3.1: Example code for failing implicit type conversion. The literal 42.0 has to be passed 
as ScalarType to the constructor of B explicitly. 
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In our code base the transformation of implicit to explicit conversions was done by hand. 
There are tools which strive to identify and fix those issues by automatic source code transforma- 
tion |HUB16]. However, the result should still be checked manually, such that no inadvertent 
loss of derivative information occurs. When deriving new software with AD in mind, explicit 
type conversions should be used as much as possible, to avoid type ambiguity and overlong type 
conversion chains. 


Non-Differentiable Functions within OpenFOAM 


The only function encountered within the OpenFOAM code base which triggers a floating point 
exception, due to evaluation at a non differentiable point, is the square root function at location 
zero. Evaluating the square root function at location zero leads to a division by zero when 


evaluating the derivative. 
Oy 1 
=f => == 4 
Cae Ox 2/x 
This case is not commonly triggered but e.g. occurs if and only if calculating the Lo norm of a 
vector of size zero. ‘To sidestep this problem, we replace the derivative of the square root function 
by a version which adds a small « > 0 to the independent when the non differentiable point of 


the square root is hit. 





i 
we a | x >0 
¥ 2./x+eE t=0 


Thus, at x = 0 the derivative evaluates to a very high number, mimicking the behavior of FD 
and approximating the right-sided limit of the square root function: 


1 
lim — = =o 
xr—0T 24/0 
Similar approaches are also taken by the primal OpenFOAM implementation to sidestep issues in 
the evaluation of turbulence functions. We have not observed any influence of the final sensitivities 
on the choice of €, leading us to believe that either the adjoints get propagated only into branches 
of the DAG which are not actually connected to the parameters, or that at least one of the partial 


derivatives on the path, connecting the parameters to the output, containing the calculation of 
V0, is zero. 


3.2.4 Black-Box Differentiation of simpleFoam Solver 





By introducing AD to the computational kernels of OpenFOAM, adjoint solvers can be developed 
on a high abstraction level. No particular insight into the low level implementation is required. 





As an example the implementation of adjointSimpleFoam will be presented. This solver is 
based on the regular OpenFOAM steady, incompressible, simpleFoam solver. It enables the 
calculations of gradients w.r.t. the design parameters a@ required for topology optimization. The 
simpleFoam solver implements the SIMPLE algorithm, as presented in Section 2.3. Further it also 
implements the faster converging SIMPLEC |VR84] algorithm, which can be enabled optionally. 
In the following a brief overview over the passive solver is given, so that the changes required for 
the adjoint versions become obvious. 
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The simpleFoam solver implements the iterations of the SIMPLE algorithm using a while loop, 
which will run until the maximum allowed iterations have been performed or previously specified 
convergence criteria have been met. Inside the loop body, the equation systems linearizing the 
Navier-Stokes equations are assembled and solved. ‘The momentum equations are implemented 
in the external file UEgn.H, the pressure correction equations in pEqn.H. ‘Those files are inlined 
by the C-preprocessor into the loop body at compile time (include directives inside function 
declarations are rarely used in C/C++ code, but commonly occur within the OpenFOAM code). 


int main(){ 
#include "createMesh.H" 
#include "createFields.H" 


while (simple.loop()){ 
// --- Pressure-velocity SIMPLE corrector 
#include "UEqn.H" 
#include "pEqn.H" 


laminarTransport.correct(); 
turbulence ->correct(); 


runTime.write(); 


Listing 3.2: Implementation of the main iteration loop of simpleFoam. 


The implementation of the momentum equations (Listing 3.3) showcases the high abstraction 
level and object oriented design of OpenFOAM. ‘The differential operators occurring in the 
Navier-Stokes equations (compare to Section 2.1) are directly visible in the assembly of the 
system matrix UEqn. The divergence operator V - (¢@u) is discretized by fvm::div(phi, U), the 
Laplacian vV7u by fvm::laplacian(nu, U). 

The right hand side consisting of the pressure gradient Vp is build by -fvc::grad(p). Note, 
that this is a simplified implementation for laminar flows. The actual implementation includes 
additional terms to model turbulence, which we omit here for clarity. 


fvVectorMatrix UEqn 
( 
fvm::div(phi, U) 

- fvm::laplacian(nu, U) 

- fvc::grad(p) 
); 
UEgn ().relax() ; 
fvOptions.constrain(UEqn () ); 
UEqn. solve () ; 
fvOptions.correct (U); 


Listing 3.3: Implementation of the momentum equations in UEqn.H. 


The pressure correction equations, as well as the momentum correction, are implemented in pEgn .H, 
shown in Listing 3.4. The notation is not as intuitive as for the momentum equations, however 
the general structure of the pressure correction equation is still visible in Line 11 of the listing. 


14 


Oo won aa fF Ww NY 


oN NY DN BP BP Be Be Be Be Be Be BB 
Wo nN FF OO OO WAN ODO oO FPF W NY fF CO 


oOnvrnoaa ®P W NY 
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volScalarField rAU(1.0/UEqn().A()); 
volVectorField HbyA("HbyA", U); 
HbyA = rAU*UEqn().H(); 


surfaceScalarField phiHbyA("phiHbyA", fvc::interpolate(HbyA) & mesh.Sf()); 
MRF .makeRelative (phiHbyA) ; 
adjustPhi(phiHbyA, U, p); 


fvScalarMatrix pEqn 


( 
f Viet ehap acim CE AW py et ve adie Cpl aHiby Ay) 
ye 


pEgn.setReference(pRefCell, pRefValue) ; 
pEqn.solve(); 


phi = phiHbyA - pEqn.flux(); 


pee lan. Os. 

// Momentum corrector 

U = HbyA - rAtU()*fvc::grad(p); 
U.correctBoundaryConditions () ; 
fvOptions.correct (U) ; 


Listing 3.4: Implementation of the pressure correction equation in pEqn.H. 


The penalty term for the topology design parameters is introduced to the solver by adding 
the component-wise product of @ with the velocities U as a source to the momentum equa- 
tions. ‘This is implemented by modifying the entries on the diagonal of the system matrix 
with fvm::Sp(alpha, U) (Listing 3.5). The function fvm::Sp accepts an implicit source term 
with positive contribution; thus the introduction of @ is not lowering the diagonal dominance of the 
system matrix. Alternatively, the resistance term could also be implemented by adding the term to 
the right hand side of the equation system as an explicit source term with -fvc::Su(alpha, U). 


fvVectorMatrix UEqn 
( 
fymssdiy (pai. UD 
- fvm::laplacian(nu, U) 
+ fvm::Sp(alpha, U) 


- fvc::grad(p) 
); 


Listing 3.5: Implementation of the source term in adjointSimpleFoam. 


For the black-box differentiation through the whole SIMPLE iteration history, no additional 
changes to the main iteration loop are required. Next, the steps required to seed and obtain 
the adjoints are added. Before entering the main loop, the tape data structure of dco/c++ is 
initialized. ‘The parameters @ are registered as inputs, allowing dco/c++ to optimize the tape 
by applying varied analysis |HNPO05|. The tape is then recorded while executing the iteration 
loop. After the loop finishes, the value of the objective 7 is calculated. The calculation of the 
objective function is implemented in a separate library which allows to calculate basic objective 
functions like power loss, drag, and lift. The adjoint of the objective is then seeded with one (the 
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3 Differentiation of Complex Iterative CFD Algorithms 


1D unit vector, for parallel processing only on the root node) and the adjoint reverse propagation 
process is started. After the propagation has finished, the desired gradient of the objective w.r.t. 
the parameters @ is available in the adjoints of alpha and can be extracted and written to the 
sensitivity field sens. 


int main(){ 
#include "createMesh.H" 
#include "createFields.H" 


dco: ale: :clobalutape = deo: .als: tape _t. create () ; 
dco::ais::global_tape->register_variable(alpha.begin() ,alpha.end()); 


while (simple.loop()){ 
// Pressure-velocity SIMPLE corrector 
#include "UEqn.H" 
#include "pEqn.H" 


laminarTransport.correct() ; 
turbulence ->correct(); 


runTime.write(); 


} 
scalar J = CostFunction(mesh).eval(); 


if (Pstream::master ()) 
dco:: derivative (J)=1; 
dco::ais::global_tape->interpret_adjoint () ; 


forAll (alpha ,i) 
sens[i] = dco::derivative(alphalLi]); 
sens.write() 


t 


Listing 3.6: Main iteration loop of adjointSimpleFoam 


Note the usage of the forAll macro in the above code, defined by OpenFOAM as a helper to 
iterate through a generic list-like data structure. It only simplifies the creation of the loop counter 
ranging from zero to the size of the list, it does not create an iterator on the supplied list (for this 
the forAllIters macro exists, however this macro is rarely used). The straightforward definition 
of the macro is given below. 





//- Loop across all elements in a list 
#define forAll@ist, ia \ 
for (Foam::label i=0; i<(list).size(); ++i) 


Listing 3.7: Definition of the forAll macro in stdFoam.H 


The use of this macro is encouraged by the OpenFOAM coding style guide |SG18], and therefore 
it will appear in some of the coming listings. 

The above implementation of a black-box solver is the simplest conceivable implementation 
of discrete adjoints using operator overloading, without the exploitation of further knowledge 
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about the problem. Such optimizations will be discussed in future sections. Building on the 
calculation of sensitivities, an additional optimization loop can be placed on the outside of this 
code to optimize the parameters e.g. using steepest descent. 


3.3. Checkpointing 


The black-box differentiation of complex programs poses challenges, if the amount of RAM required 
to store the tape data structure outgrows the available physical memory. For iterative problems, 
an effective approach to lower the memory footprint is to incorporate recomputation, trading 
off lower memory usage against increased run time. A systematic approach to recomputation 
techniques is known as checkpointing [|GW08] and is introduced in the following. 


3.3.1 Introduction to Checkpointing 


Checkpointing is a technique commonly used to lower the memory footprint needed to adjoin 
complex programs. It uses the deterministic nature of computer programs, which allows to restore 
the exact state of a program from the state at some earlier (or initial) point in the execution 
history. 

Checkpointing involves a memory vs. run time trade off. The storage of all intermediate values 
needed for the reversal process is replaced by storing only selectively, and recomputing the missing 
values when they are needed. To avoid having to completely restart the program to generate 
those missing values, intermediate states of the program (checkpoints) are stored (either in RAM 
or on disk), which allow to restore the state of the program at an intermediate step and resume it 
from there. For the common occurrence of an iterative computation, embedded inside a main 





loop, it is practical to only record a small number of loop steps, adjoin them, and then resume 
the program from an earlier state in order to record the missing loop steps. ‘his process can be 





repeated recursively, until all loop steps have been adjoined. 

Figures 3.8 and 3.9 illustrate the basic checkpointing procedure for four iterations. In the first 
figure only one checkpoint is placed before the first iteration, saving the initial state x°. This 
allows to recreate state x’ of the simulation at any iteration i, by recalculating the state from 
the initial state x9. A single iteration step is adjoined at a time. Thus, only the partials of one 
step need to be stored at a time, instead of four. As the last iteration step is adjoined first, the 
previous (third) iteration step can not be immediately adjoined afterwards. The state x7, needed 
to execute the augmented primal iteration f?, is not available in memory. State x? must therefore 





be recomputed by restoring the only available state x?, and executing (in passive mode) iterations 
f' and f*. Only then can iteration f? be executed in augmented forward mode. After the 
required intermediate values of this iteration step are available in memory, the adjoints obtained 
from adjoining the first iteration step (f*) can be fed back as an input to the calculation of 
the adjoints for iteration step f*. In the figure, this feedback of adjoints is indicated by right 
facing arrows The procedure is repeated for iteration steps two and one, requiring additional 
recomputations of the states x” and x!. After all iteration steps have been adjoined, the data 
flow reversal can be finished by adjoining the pre-processor. ‘The approach to only checkpoint the 
minimal amount of information necessary and recompute all other information is called recompute 
all approach. 

The amount of passive recomputation can be minimized, by not only checkpointing the initial 
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Figure 3.8: Reversal of iteration history with only one checkpoint for state x°. All other states 


Figure 3.9: Reversal of iteration history with checkpoints for states x 
to reverse all iterations without any extra recomputation (checkpoint all approach). 
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state x7, but also all other states. This allows to immediately start the necessary forward 
evaluations of the iteration steps, without performing passive recomputations to advance the 
iteration state first. For the example in Figure 3.9, those states are x°,x!, and x*. Checkpoints 
for x° and x* are not required, as they would never be restored (the post processor is adjoined in 
conjunction with the last iteration step immediately after executing the three passive forward 
evaluations). The adjoints of state x° and x* are available after the first interpretation step and 
thus a checkpoint for x? is not required. The approach to store all possible checkpoints is called 
Checkpoint all approach. 





For the general case of k iteration steps, the cost of adjoining the entire iteration history using 
the recompute all and checkpoint all schemes are as follows. Adjoining one iteration step at a 
time the checkpoint all approach requires the augmented forward and reverse evaluation of k 
iteration steps, as well as the (passive) forward computation of k — 1 iteration steps to reach the 
first active step. ‘he recompute all approach needs the same k active evaluations. In addition 
it needs 
k(k — 1) 


(k—-1)+(k-2)+...41= 5 


passive iterations to recover the state from the initial state for every reverse evaluation step. 
Therefore ((& —1)-(k—2))/2 additional passive iterations are needed, compared to the checkpoint 
all approach. It is clear that the behavior of recompute all with a run time factor of O(k) 
compared to the black-box evaluation is undesirable. Checkpoint all has an attractive run time 
factor of O(1), compared to black-box (assuming passive and active steps are of same run time 
cost, the factor is lower than two. In practice it will be lower), however the number of checkpoints 
which can be stored is limited by the available virtual or physical storage space. Thus, in practice 
a checkpointing scheme is chosen, that aims to minimize the calculation time, while limiting the 
number of checkpoints to still fulfill the memory constraints. 

The question of how to optimally place a fixed number of checkpoints, in an evolution of arbitrary 
function calls with known cost, is a combinatorial NP-hard problem |Nau08]. A special case is 
the application to iterative methods, where checkpoints can be placed between each iteration, 
each iteration is assumed to have equal cost in terms of run time and memory (or at least the 
cost is quantifiable a priori), and where the number of iterations performed is known beforehand. 
For those assumptions, provably optimal spacings can be given without solving an optimization 
problem first. Such a provably optimal spacing is given by the revolve-algorithm |GW00; GW0O8]. 

For the reversal of the SIMPLE algorithm in discrete adjoint OpenFOAM, we first considered 
equidistant checkpointing. Intermediate steps of the problem are stored at a fixed distance 
and the checkpoint locations remain constant for the whole program execution. ‘This simplifies 
the implementation significantly, but ignores the possibility to reuse checkpoints once they 
are no longer needed, because all iteration states reachable from the checkpoint have already 
been adjoined. 





The discrete adjoint OpenFOAM package also implements binomial checkpointing in the form 
of revolve algorithm. The revolve algorithm places checkpoints with a logarithmic spacing, making 
sure that checkpoints are spaced more densely next to the current interpretation step, minimizing 
the number of recomputation steps. Checkpoint locations are reused once they are not needed at 
their former position anymore. 

Figure 3.10 shows the application of revolve algorithm to 100 iteration steps of the SIMPLE 
algorithm for different numbers of checkpoints. For example, for three checkpoints, revolve places 
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Figure 3.10: Calculation and reversal of 100 iteration steps with 100 (checkpoint all), five, and 
three revolve checkpoints. One iteration step is recorded at a time. Horizontal lines indicate 
the evolution of the checkpoint positions during the reversal procedure. 


checkpoints at iteration steps (0,65,92). The upper checkpoint is consecutively lowered until 
iteration step 65 has been reached and the center checkpoint can be replaced at a lower location. 
The checkpoints at iteration 0 is restored and is used to place the new checkpoints at (0, 38, 58). 
This process is repeated until all iteration steps have been recorded and adjoined. 

All checkpointing by default is done on the outer level of the iteration loop, therefore at least 
one complete iteration step has to fit into RAM. With some implementation effort, this can be 
transformed to capture checkpoints in between different PDEs (e.g. for the SIMPLE algorithm to 
place checkpoints between the momentum, mass conservation and turbulence equations). Further 
breaking it down to inside the PDE solver level is more challenging, due to the strict scoping 
of C++. The automatic placement of checkpoints at arbitrary positions in the program is currently 
a development target for dco/c++. Theoretical groundwork has already been laid in |Lot16]. In 
practice we have observed that problems which would require such high amounts of RAM profit 
immensely from being distributed to different MPI nodes (see Chapter 5), making more granular 
checkpoints rarely necessary. 

For rising number of checkpoints, the difference between equidistant and binomial checkpointing 
vanishes. In the limit case, where every iteration step can be checkpointed, binomial checkpoint- 
ing naturally cannot improve over an equidistant scheme anymore. ‘The number of required 
recomputations for equidistant and revolve checkpointing over the number of placed checkpoints 
is illustrated in Figure 3.11. 








3.3.2 Implementation of Checkpointing in Discrete Adjoint OpenFOAM 


One requirement for the application of checkpointing is the separability of the main iteration 
loop, that is the solver has to be implemented such that the iteration procedure can be run for a 
specified number of iterations from an arbitrary state. 

The stock OpenFOAM solvers are not structured to easily achieve this. All iterations are placed 
in a main loop and all flow fields are scoped locally to the main routine of the solver. ‘Therefore a 
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Figure 3.11: Number of required recomputations m for checkpointing 1000 iteration steps for 
increasing number of checkpoints ncp. Run time ratio between equidistant and revolve in 
green on second y-axis. 





wrapper around the adjointSimpleFoam solver was created, moving the flow fields into a class 
and separating the individual loop iterations into a member function. ‘This function can be called 
to advance the simulation by a single iteration step. Additionally this class implements abstract 
methods from a CheckController class, which is designed to make the checkpointing interface 
applicable to a variety of solvers by defining a common set of routines to be implemented by every 
checkpointed solver. The separation of iteration steps is also useful for interfacing with external 
optimizer packages, which require the repeated evaluation of the primal simulation or gradients. 

The checkpointing functionality is implemented in the discrete adjoint OpenFOAM framework 
by an abstract interface, from which the different checkpointing strategies are derived. ‘The 
following functionality is provided by the checkpointing interface: 











Store checkpoint: Store the primal values of the required flow fields x into a temporary 
buffer B; € R”™. 


Restore checkpoint: Overwrite the primal values of the flow fields x with the content of 
buffer B; € R”™. 


Register variables: Register the variables of the state x in the tape and remember the assigned 
tape indices in T € N”™. 


Store adjoints: Extract the adjoints from the tapes adjoint vector (from the locations stored 
during the register variables step) and store them in a temporary buffer T € R™. 


Restore adjoints: Inject the stored adjoints from the buffer T back into the tape at the locations 
currently occupied by x. 
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Figure 3.12: Class diagram of the checkpointing interface. 


Figure 3.12 shows the class structure of the checkpointing interface, developed for discrete 
adjoint OpenFOAM. It allows to checkpoint instances of geometricField, which is a base class of 
volScalarField (e.g. used for pressure), volVectorField (e.g. for velocity), surfaceScalarField 
(e.g. face fluxes), and surfaceVectorField (e.g. surface normal vectors). In addition also 
generic scalar data can be checkpointed. Currently the checkpointing strategies equidistant 
(CheckEquidistant) and revolve (CheckRevolve) are implemented as specializations of the ab- 
stract CheckMethod base class. For verification, a dummy class (CheckNone) which implements 
black-box differentiation is implemented as well. 

The memory size occupied by ncp checkpoints is 


Mcp = (ncp + 1): nx - sizeof (double) + nx - sizeof (Foam: :label) , (3.2) 
a — 
nop checkpoints + one adjoint buffer tape index store 


where Foam: :label is a typedef for the data type used for integer data. Assuming that 64 bit 
double and long types are used, the memory requirement becomes 


Mcp = (ncp + 2): nx - 8 bytes. 


The size of the tape index storage can be reduced from O(nx) to O(1) by using the fact that all 
variables of a field are registered consecutively, and thus the tape indices of a field are adjacent 
and deterministic. 

Figure 3.13 shows the run time of the adjointSimpleCheckpointingFoam solver for adjoining 
all iteration steps of the SIMPLE algorithm on the angled duct example. Observed are two, five, 
and 100 checkpoints and iteration ranges from [0,1] to [0,100]. The curves for 100 checkpoints, 
which correspond to a checkpoint all approach, show that the run time does not increase linearly. 
This is due to the decreasing cost of iteration steps near to the converged solution, as the embedded 
linear solvers need less iterations to converge to their tolerance limit. 

In OpenFOAM, the ratio between tape cost and checkpoint cost is usually very high, making it 
feasible to store a high number of checkpoints. ‘This is demonstrated by ‘Table 3.4 and Figure 3.14. 
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Figure 3.13: Run time over number of calculated iterations adjointSimpleCheckpointingFoam 
for varying density of checkpoints. Dashed lines indicate symbolically differentiated linear 


solvers. 


Table 3.4: Checkpoint size and average tape size with standard deviation o for one SIMPLE 


iteration step. 


Case Checkpoint size Tape size SDLS Tape size black-box 
(MB) Avg. (MB) o (MB) Avg. (MB) o (MB) 
Pitz-Daily 1.74 442.55 0.57 2282.19 167.127 
Angled duct (Lvl. 6) 1.11 232.08 0.19 990.05 134.39 
Angled duct (Lvl. 12) 4.36 923.51 0.81 3811.86 546.49 
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Figure 3.14: Average tape size (left y-axis) and checkpoint size (right y-axis, blue curve) for the 
angled duct case of different mesh resolutions. 


The figure shows both the checkpoint and average tape size for 20 SIMPLE steps of the angled 
duct example, for varying mesh densities. ‘he coarsest possible configuration for this case consists 
of 325 cells. From this configuration, 20 different levels of mesh refinement are created, ranging 
up to 130000 cells. 

The checkpoint size grows linearly with nc, as predicted by Equation (3.2). With symbolically 
differentiated linear solvers (SDLS, see Section 3.4), the tape size also grows linearly. The tape 
size largely remains constant over all iterations of the SIMPLE algorithm. Using black-box 
differentiation of the linear solvers, the tape size becomes less predictable, due to the varying 
number of inner linear solver iterations. Thus, in the figure we also give error bars and the 
standard deviation of the tape sizes. On average, the tape size still grows roughly linearly with no, 
albeit with a much bigger factor. 

The table includes the angled duct case for refinement levels 30 (11700 cells) and 60 (46 800 
cells). For both configurations, the ratio between tape size (utilizing SDLS) for one SIMPLE 
iteration step and one checkpoint is roughly 210. For the Pitz-Daily case with 12 225 cells, the 
ratio is even higher, with roughly 250. This is due to the additional complexity introduced by 
the differentiation of the k-e turbulence model. 

Thus, binomial checkpointing only reaches its full potential for cases with many iteration steps, 
that can not be covered well by equidistant checkpoints. Such cases are e.g. generated by transient 
flow simulations with high Reynolds numbers, which require a very small time step At to converge 


reliably. 
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3.3.3 Verification of the Checkpointing Implementation 


One requisite for a successful application of checkpointing is that all variables belonging to the 
states of the iteration loop are correctly identified. All variables which 


e are overwritten inside of the iteration, 
e are scoped outside of the iteration, 
e and have an influence on the final outcome of the calculation 


need to be included in the checkpoint. ‘The obvious first step to check the correctness of 
the checkpointing implementation is to compare the results to results obtained by black-box 
differentiation. However, if the results are incorrect, further insight into the process is needed, to 
determine which part of the calculation is incorrectly handled. 

With only minor modifications to the solver and AD tool, it can be verified if all required state 

variables have been identified or if any are missing. Every iteration step has to be self contained, 
i.e. the operations inside of an iteration step are only allowed to depend on the primal values of 
the current state, variables which are local to the current iteration, or globally defined variables 
that are never overwritten (e.g. the parameters a). Any dependency to previous iterations has 
to be localized, that is added to the state, by the checkpointing routines. ‘This makes sure that 
when resuming from a checkpoint, all required values are available (in scope) and have the right 
numeric value. 
The AD tool can be adapted to identify edges in the tape, obtained by the black-box solver, which 
point to regions which will become illegal when checkpointing is applied and the order of loop 
execution becomes non-consecutive. A set of legal and illegal edges in the tape is illustrated in 
Figure 3.15. This technique was used to identify an issue in the checkpointing of shape adjoints. 
This will be discussed in Section 4.4. 





Adjoint Vector 

















Figure 3.15: Legal (green) and illegal edges (red) for checkpointing. Edges are allowed to point 
to the global optimization parameters, to entries local to the iteration, and to entries of the 
directly preceding state. All other edges are illegal. 
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3.4 Symbolic Differentiation of Embedded Linear Solvers 


Typically, the run time of PDE solvers is heavily dominated by the execution of linear equation 
solvers (compare to Chapter 5), which repeatedly solve the linearized and discretized differential 
equations. While differentiating all linear solvers with AD technically works, it is in practice 
undesirable, as the black-box differentiation of certain classes of linear solvers exhibit numerical 
instabilities |Chr18]. Furthermore, it is very costly to fully differentiate those solvers, especially 
in terms of memory required for the storage of the tape. For iterative solvers, the cost is strongly 
correlated with the condition number of the linear system. The cost grows linearly with the 
number of iterations required by the solver, which in turn grows with the condition number. 

We will now outline how the adjoints of linear systems can be obtained, without fully dif- 
ferentiating the solution process by AD. When calculating the solution x to the linear system 
Ax = b without applying AD techniques, the adjoints A and b are not propagated automatically. 
However, using the known adjoints of the solution x, the adjoints of A and b can be calculated 
symbolically |Gil08; Nau+15]. 


3.4.1 Symbolic Adjoint Relations for Linear Systems 


Theorem 4 (Calculation of adjoints of RHS vector b). 
For a regular matrix A € RX”, vectors « € R",b € R”, and «x = A-'D, the adjoints of b are 
given by b= A~?E. 


Proof. See |Gil08; Nau+15]. a 


Instead of explicitly calculating the transposed inverse A~’ of A, the adjoints of the right hand 
side can be calculated at cost O(n?) (assuming dense matrices) by solving the equivalent linear 
equation system 

AP bx. (3.3) 


This is particularly beneficial for sparse systems, as the inverse of a sparse matrix is in general 
not sparse. For the sub-case of symmetric matrices and direct linear system solvers, an existing 
(LU,QR or Cholesky) factorization of A from the primal solve Ax = b might be reused, at the 
cost of storing the possibly more dense factorization instead of A. This lowers the complexity 
from O(n?) to O(n?) for the forward and backward substitution of the factorization. 


Theorem 5 (Calculation of matrix adjoints A). 
For a regular matric A € R°*", vectors 2 € R",b € R” and « = A~'D, the adjoints of the 
individual matrix entries A are given by the outer product A= —b@ 2a". 


Proof. See [Gil08; Nau+15]. OC 


The adjoints A and b can thus be calculated by solving the additional equation system (3.3) and 
evaluating an outer product. We assume that the memory locations of A and b are not aliased 
and thus A and b are truly independent. 

To summarize, the discrete AD differentiation of the linear system solvers x = S(A,b) is 
replaced by the following steps: 
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During primal solution: 
A-x=b —> x:=S(A,b) ; 


During reverse propagation of adjoints: 


A’-b=x% — b:=S(A’,x) (3.4) 

A:=—-be@x’. (3.5) 

In the context of CFD, the linear solvers are embedded into a non-linear (iterative) calculation. 
The matrix A is usually stored in a sparse representation (e.g. compressed row storage or coordinate 


format). The outer vector product in Equation (3.5), forming A, only needs to be applied to the 
adjoints corresponding to non-zero entries. While the outer product produces a dense matrix 





of adjoints, structurally zero matrix entries will by definition never influence any results, their 
adjoints are thus deemed irrelevant. 

Note, that for the symbolic adjoint relation to hold, it is required that x is actually the solution 
to A:-x =b, that is x = A~!b. When using iterative linear solvers, stronger accuracy limits may 
be required during the primal evaluation, in order to achieve the desired accuracy in the symbolic 
adjoint. Furthermore, the concept of relative tolerances (where one does not prescribe an absolute 
residual threshold, but a reduction of the residual in relation to the initial state by a certain 
factor) should not be applied when differentiating the linear system symbolically. It has been 
shown, that the concept of relative tolerance can still be applied to symbolically differentiated 
iterative linear solver schemes, when certain corrections are applied |AHM16]. 

A case study for the propagation of errors, due to the residual of the primal and adjoint linear 
systems is shown in Figure 3.16. Observed are 150 SIMPLE iteration steps of the angled duct case, 
introduced in Section 3.1.2. Linear solvers are GAMG for the pressure equation and Gauss-Seidel 
for the momentum equations. The linear systems are solved with a solver tolerance of e+ for the 








primal solvers (for pressure and velocity) and e, for the linear solvers in the reverse propagation 
phase. The errors are obtained by comparing to a reference solution, obtained by ef = €, = 10>, 
which is at the limit of machine precision. ‘The obtained results are not completely smooth, as the 
tolerances €f and €, are only the upper limit for the linear solver. The actual tolerances achieved 
by the discrete number of iteration steps of the linear solver might be lower, and do not scale 
linearly with the prescribed tolerance e. 

dco/c++ allows to stop the recording of the adjoint stack (switch to passive mode) and restart 
the recording at a later time (switch to active mode). The resulting gap in the tape has to be filled 
when the adjoints are propagated from the outputs to the inputs, during the adjoint propagating 
phase. For this purpose, dco/c++ allows to create functions to be called at a specific point in the 
adjoint propagating process, using a callback interface. Figure 3.17 shows a conceptual overview 
of the solution process of one velocity-pressure correction step with the corresponding calls to 
linear solvers (inside the gray boxes) and how adjoint information is propagated. The information 
missing from the tape due to the gaps must be supplied by evaluating Equations (3.4) —(3.5) 
inside the adjoint callback functions. 

As the tape is switched off during the linear solver calls, passive versions (using regular floating 
point data types) of the linear solvers can be called, increasing performance and reducing memory 
overhead. However, for this the data of the matrix and vectors need to be copied into passive 
containers, introducing some additional code and data duplication. 

The implementation of the symbolically adjoined linear solvers in the context of a sparse 
iterative CFD solver will be discussed in the following section. 
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mean error 





10~° 107 





Figure 3.16: Mean error between reference result and result obtained by forward linear solver 
tolerance ef and reverse solver tolerance ¢,.. Differentiated are 150 SIMPLE iteration steps of 
the angled duct case, starting from a solution of the potential equations. 


3.4.2 Implementing SDLS in OpenFOAM 


In OpenFOAM, the assembly and solution of linear equation systems is implemented in the 
finiteVolume library, using general linear equation system solvers implemented in the OpenFOAM 
library. The solvers for scalar and vector fields are implemented in 


e fvMatrices/fvScalarMatrix/fvScalarMatrix.C and 
e fvMatrices/fvMatrix/fvMatrixSolve.C 


respectively. For vector fields, the matrix (FVM discretization) coefficients are scalars and 
identical for all dimensions. The coefficients differ only for the interfaces, which are not part of 
the LDU coefficients but are stored separately. Interfaces represent the boundary conditions and 
internal processor faces, if solving a parallel case. The equation systems are solved in a segregated 





fashion, i.e. the vector components are decoupled and solved for independently. 

To solve a vector equation (e.g. from the discretized momentum equation), three (for the general 
3D case) scalar equations with the same matrix coefficients but different interfaces and right hand 
sides are solved. The coupling between the equations is only introduced by the outer non-linear 
iteration scheme and the right hand side contributions. ‘To compute the symbolic adjoints of the 
linear solvers, it is therefore sufficient to focus on the solution of scalar equations and let the AD 
tool automatically differentiate the assembly of the scalar subproblem from the vector problem. 

The matrix classes in the finiteVolume library pass a lduMatrix (see Section 2.6), which is 
assembled from the chosen discretization schemes and boundary conditions, as well as a right 
hand side vector and an initial guess (usually the result of the previous non-linear iteration) to 
the linear equation system solvers defined in the OpenFOAM library. 

Currently, the following iterative linear equation system solvers for lduMatrices are imple- 
mented in OpenFOAM: 
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(a) Augmented forward creating gaps (b) Adjoint reverse propagation filling gaps 





Figure 3.17: Outline of the procedure for filling the gaps in the tape created during the 
augmented primal section by executing passive versions of the linear solvers only. New outer 
iteration values U‘t! and p’*! are generated from U’ and p’. The shaded parts indicate the 
scope of the linear solvers. The upward facing dashed lines indicate the part of the solution 
process that is treated symbolically. Consequently this part is not stored in the tape and has 
to be supplied in an adjoint callback function in the (reverse) adjoint propagation step. 








GAMG: Geometric Algebraic Multigrid solver with multiple choices for the smoother |Bra77], 
PBiCG: Preconditioned Biconjugate gradient method |Fle76], 

PBiCGStab: Preconditioned Biconjugate gradient stabilized method |Van92], 

PCG: Preconditioned Conjugate Gradient method (for symmetric matrices) |HS52], 


smoothSolver: Simple preconditioned Gauss-Seidel solver. 


All solvers are coupled to the 1duMatrix class with the same interface, and thus for the calculation 
of symbolic adjoints any of the above mentioned solvers can be used for the primal and adjoint 
callback linear systems. When using symbolic adjoints for the linear solvers, there is no particular 
need to use the same solver for the primal and adjoint linear systems. If e.g. the smoothSolver 
is used for the primal, to obtain a better stability of the primal convergence, it might be possible 
to use a solver with better convergence properties like GAMG for the adjoint. 
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We will highlight the major implementation points of the solver for scalar matrices implemented 
as fvScalarMatrix: :solveSegregated() in fvMatrices/fvScalarMatrix/fvScalarMatrix.C. 
The code is simplified, in that it always assumes an asymmetric matrix. For symmetric matrices, 





the matrix transposal is not necessary, and not all matrices elements need to be incremented. 
However, the symmetric part of the matrix has to be incremented as well, making the code more 
complex. Furthermore, the code always assumes that the tape is switched on at the entry to the 
function, and thus symbolic differentiation is required. ‘The full code, without those assumptions 
and the ability to switch off symbolic differentiation on demand, is included in Appendix B. 

In Listing 3.8 the code for the assembly and solution of a scalar lduMatrix is shown. First the 
solution vector psi is initialized with the field of the last iteration psi_ as an initial guess. ‘Then 
the diagonal of the matrix is modified with coefficients of the boundary conditions (Line 5). Note 
that the matrix coefficients are stored in the fvMatrix object, and are accessible by dereferencing 
the this pointer. In order to not alter the matrix permanently, the diagonal is stored (Line 4) 
and restored at the end of the function (Line 21). The right hand side vector totalSource 
is assembled from the member field source_ and the influence of boundary conditions (Lines 
7,8). After those preparations, the lduMatrix solver object can be assembled from the matrix 
coefficients stored in *this (Line 13), the boundary and internal coefficients (Lines 14,15), 
the scalar interfaces of the solution vector psi, and the solverControls object. ‘The newly 
constructed solver object is then used to solve the linear equation system with the right hand 
side totalSource for the unknowns psi (Line 18). Finally, the solution vector psi is corrected 
to comply with boundary conditions (Line 21). 


Foam::solverPerformance Foam::fvMatrix<Foam::scalar>::solveSegregated (const 
dictionary& solverControls){ 
auto& psi = const_cast<GeometricField<scalar, fvPatchField, volMesh>&>(psi_); 


scalarField saveDiag(diag()); 
addBoundaryDiag(diag(), 0); 


scalarField totalSource(source_); 
addBoundarySource(totalSource, false); 


j/ BOlvVer veal) 

solverPerformance solverPerf = lduMatrix::solver::New( 
psi.name(), 
*this, 
boundaryCoeffs_, 
internalCoeffs_, 
psi_.boundaryField().scalarInterfaces(), 
solverControls 

)->solve(psi.primitiveFieldRef(), totalSource) ; 


diag() = saveDiag; 
psi.correctBoundaryConditions() ; 


LY 


Listing 3.8: fvMatrix solve routine, can be differentiated by AD without change. 


This code is taken straight from the OpenFOAM code base and can be used without any changes 
to compute the black-box derivatives of the linear solvers. 
In Listing 3.9 the modifications that were made to incorporate the symbolic differentiation of 
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the linear equation systems are shown. 

The gap in the tape is opened by switching off the tape in Line 10 and closed by switching it 
back on in Line 43. This gap only wraps the creation and solution of the 1duMatrix solver object. 
However, it does not include the code which modifies the matrix and source terms according to 
the boundary conditions. This initialization code is still handled by AD. The modified solver 
creates a callback object, which is inserted into the tape in Line 42. It will be called during the 
reverse propagation phase, when the gap in the tape is encountered. 

A dco/c++ callback object allows to store 


Callback function: Function that will be called when the interpretation of the tape reaches the 
gap. Contents of the callback object are passed as argument to this function. 


Output variables: Variables, the adjoints of which will be passed as an input to the callback 
function. 


Input variables: Variables, the adjoints of which will be calculated by the callback function. 


Data variables: Variables, the values of which are copied and made available to the callback 
function as auxiliary input variables. 


For the symbolic differentiation of the linear solver, we store the following data (Lines 34-37): 


e A string, which holds the field name (e.g. "p" for pressure), in order to look up the solver 
settings for the linear solver in the reverse section. 





e An int, which holds a direction, (0,1,2) indicating for which dimension should be solved. 
A scalar field is indicated by direction —1. 


e A fvMatrix, which holds a copy of matrix (*this), used to solve the primal equations. 


e A scalarField, which holds a copy of the solution psi, computed by the primal linear 
equation system solver. 


The adjoints of the solver output psi are required as inputs to the callback function, and are 
thus registered as output variables in Line 40. 

By passing the input adjoints and the data variables to the callback object, the desired adjoints 
can be computed during the execution of the callback function. ‘The algorithms to compute the 
symbolic derivatives are shown in Listings 3.10 and 3.11. The first listing shows the retrieval of 
the output adjoints and auxiliary variables, the assembly of the adjoint system and its solution. 
The data variables, in particular x and the matrix coefficients corresponding to A are restored in 
Lines 3-8. The incoming adjoints x of the solver outputs are read in Lines 11,12. The internal 
and boundary coefficients of the matrix are set in Lines 20,21. ‘These coefficients differ for the 
different components of a vector field. ‘To extract the correct coefficients from the callback 
object, here the stored direction is needed. For a scalar field, component (i) always returns a 
reference to the scalar field, no matter which dimension is passed. Therefore, those lines both 
work for Type=scalar and Type=vector. 

Next, the matrix A is transposed (here always assumed to be necessary). For the OpenFOAM 
sparse matrix format, this can be done by swapping the upper and lower coefficient vectors, 
as well as the boundary and interior coefficient vectors (Lines 24-31). As the matrix A stored 
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1 solverPerformance fvMatrix<scalar>::solveSegregated(const dictionary& 
solverControls){ 

GeometricField<scalar, fvPatchField, volMesh>& psi = 
const_cast<GeometricField<scalar, fvPatchField, volMesh>&>(psi_); 


scalarField saveDiag(diag()); 

addBoundaryDiag(diag(), 0); 

scalarField totalSource(source_) ; 
addBoundarySource(totalSource, false); // assemble RHS vector 


Oo ana a FF WwW WD 


10 ADmode::global_tape->switch_to_passive(); // Gap in tape starts here 

11 

12 ADmode::external_adjoint_object_t* D = ADmode::global_tape-> 
create_callback_object() ; 


13 

14 forAll(totalSource, i) 

15 D--repister input (Lotalsource |i); 

My ene (Glin GS Oe 1 6 lis omtyjocie ©) eee) 3 Beare) 

17 D->register_input (this->upper() Li]); 

18 forint a) — 00 i < thas) lower Oo size() a4) 

19 D->register_input (this->lower() Li]); 

20 sete ((claste SL 1 10/2 aL k< Gelolaife\ = > felaltayon (()) fetal ace) (0) 2 ah cnae) 

21 D->register_input (this->diag() li]); 

22 

23 // solve (*this) * psi = totalSource 

24 solverPerformance solverPerf = lduMatrix::solver::New 

25 ( 

26 psi.name(), 

a7 *this, 

28 boundaryCoeffs_, 

29 internalCoeffs_, 

30 psi_.boundaryField().scalarInterfaces(), 

31 solverControls 

32 )->solve(psi.primitiveFieldRef(), totalSource) ; 

33 

34 D->write_data(psi.name()); 

35 D->write_data(Foam::direction(-1)); // dummy direction for scalar fields 

36 D->write_data(*this); // copy of lduMatrix representation 

37 D->write_data(psi); // copy of solution to (*this) * psi = totalSource 

38 

39 forAll(psi.primitiveField() ,i) 

40 psi.primitiveFieldRef () Li] = D->register_output (dco::passive_value((psi. 
primitiveFieldRef()[il))); 

41 


42 ADmode:: global_tape->insert_callback<ADmode::external_adjoint_object_t >( 
symbolic::fill1SolverGap<Foam::scalar>,D) ; 

43 ADmode::global_tape->switch_to_active(); // Gap in tape ends here 

44 


45 diag() = saveDiag; 
46 psi.correctBoundaryConditions() ; 
47\ } 


Listing 3.9: fvMatrix solve with creation of solver gap and checkpoints of the necessary data. 


92 


3.4 Symbolic Differentiation of Embedded Linear Solvers 


in the callback object is read only (and might be reused later), a copy is constructed for the 
transposed matrix. 

The solver controls are read in from the system/fvSolution file, specifying the class of linear 
solver and required solution tolerance (Lines 33, 35). The adjoint equation linear systems 


A'b=-xX 


is then assembled and solved in Lines 37-48. The result is available in a1_b after the solver 
has finished. 

The second Listing 3.11 takes the adjoints computed from the adjoint linear equation system 
and writes them to the corresponding places in the adjoint vector of the tape. ‘The adjoints 
of b can be directly written to the tape via the registered input adjoints. For the adjoints of A, 
the outer product is calculated as aj; = b; - x; and written to the corresponding locations in 
the LDU addressing (obtained in Lines 8,9). For the diagonal part of the matrix, the indices 
directly corresponds to the index of the cells, therefore no addressing is required (Lines 23-25). 
For the off-diagonal entries, the indices have to be retrieved from the LDU addressing first. The 
incrementation of the off-diagonal coefficients is implemented in Lines 12-20. 

It is important, that the input adjoints are incremented in exactly the order in which they were 
registered with the callback object, as else wrong elements of the adjoint vector are incremented. 

Here we omitted the treatment of parallel boundaries and symmetry. The former will be 
introduced in a later section, the latter is detailed in Appendix B. 
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1} template<class Type> 
2 inline void fillSolverGap(typename Foam::ADmode::external_adjoint_object_t *D){ 


3 
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const Foam::word& fieldName D->read_data<Foam::word>(); 
const Foam::direction& cmpt = D->read_data<Foam::direction>() ; 


const Foam::fvMatrix<Type>& A = D->read_data<Foam::fvMatrix<Type> >(); 


const Foam::volScalarField& x_ref = D->read_data<Foam::volScalarField>(); 


const Foam::scalarField& x = x_ref.primitiveField(); 


Foam::scalarField ai_x(x); 
iow All Cases is) 


ai_xli] = D->get_output_adjoint(); // read incoming adjoints from tape 


Foam::fvMatrix<Type> A_T(A); // will hold transpose of A 


// component() will return scalarField for Type=scalar 


Foam::FieldField<Foam::Field,Foam::scalar> bcmpts = A_T.boundaryCoeffs(). 


component (cmpt) (); 


Foam::FieldField<Foam::Field,Foam::scalar> icmpts = A_T.internalCoeffs(). 


component (cmpt) () ; 


// transpose matrix by swapping coefficient arrays 
A_T.lower() = A.upper(); 

A_T.upper () A.lower (); 

icmpts = A_T.boundaryCoeffs().component (cmpt) (); 
bempts = A_T.internalCoeffs().component (cmpt) (); 


// Lookup solver controls 
Foam::word reverseFieldName = fieldName + Foam::word("Reverse"); 
Foam::volScalarField ai_b(reverseFieldName ,x_ref); 


const dictionary& reverseSolverControls = ail_b.mesh().solverDict(L...]); 


Vy / eso vic whe 2 ee eb — alos 


Foam::lduMatrix::solver solver = Foam::lduMatrix::solver::New 
( 

fieldNameCmpt + Foam::word("Reverse"), 

Aa 

bempts, 

icmpts , 


ai_b.boundaryField().scalarInterfaces(), 
reversesSolverControls 


ye 


i) “solve. for sys 


solverPerformance solverPerf = solver->solve(ai_b.primitiveFieldRef (), 


[...] // coptimedfin next listing 


Listing 3.10: Adjoint callback routine: Solution of the adjoint system. 


al x): 


Oo wana a FF Ww NY 
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[ee / “COnGinvatwonmoL splevioucse isu ding: 

// increment input adjoint for b 

for(int i) =O. i) < all be size ©] i1++)4 
D->increment_input_adjoint( dco::value(ai_bLlil]) ); 


t 


// Addressing for upper and lower half of matrix 
const labelUList& uAddr = A_T.1lduAddr().upperAddr () ; 
const labelUList& lAddr A_T.1lduAddr().lowerAddr () ; 


// increment adjoints for A (upper part) 

double inc = Q; 

for(int i = 0; i < this->upper().size() @ ist) { 
D->increment_input_adjoint( dco::value( -ail_b[lAddr[iJ]]*x[uAddr[i]] ); 

Jr 


// increment adjoints for A (lower part) 

for(int i = 0: i < this->lower() .aizeQ@): i++) { 
D->increment_input_adjoint( dco::value( -at_b[uAddr[iJ]*x[lAddr[i]] ); 

Ir 


// increment adjoints for A (diag part) 
for(int i = 0; i < nd; itt+)f{ 

D->increment_input_adjoint( dco::value(-ail_bli]*xLli]) ); 
i 


// treatment for parallel boundaries will go here, see later Sections 


[eed 


Listing 3.11: Adjoint callback routine: Incrementation of input adjoints. 
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3.5 Adjoints of Parallel Communication 


In this section, we will investigate how to efficiently adjoin iterative MPI parallel programs, 
particularly in the presence of linear solvers. First the general concepts of MPI are introduced, 
then the calculation of adjoints via AMPI is discussed. Building on those foundations the 
calculation of adjoints in the CFD context is explored. Lastly, it is shown how the SDLS approach 
presented in Section 3.4 can be adapted for (A)MPI parallel calculation. 





3.5.1 Message Passing Interface 


The Message Passing Interface (MPI) |Mes94] is the de-facto standard for implementing parallel 
C, C++ and Fortran codes on distributed memory machines. With MPI, each compute node runs 
one or multiple processes of the same executable, each calculating a subproblem of the serial 
problem (e.g. for CFD calculating the flow field on a subdomain of the original domain). All 
communication with other processes is wrapped into messages, which are passed from one process 
to another by the MPI libraries. A message is essentially a chunk of memory contiguous bytes 
of arbitrary length, that is accompanied by some meta data (e.g. tags). One process can not 
directly access the memory of another process, even if both share the same virtual memory 
space. All data which is not available in the local memory space has to be distributed using 
messages. Communication in MPI is either point-to-point or collective. Point-to-point messages 
originate from one process and are delivered to exactly one other process. An example for this is 
the MPI_Send and corresponding MPI_Recv construct for sending and receiving messages. ‘The 
signatures for the MPI_Send and MPI_Recv routines are defined in the MPI standard as 





int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int 
tag, MPI_Comm comm) ; 


and 


int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, 
MPI_Comm comm, MPI_Status *status) 


respectively. ‘The MPI_Send command sends a message of length count and type datatype to 
the process specified by dest. The call to MPI_Send will block until the receiving process has 
called MPI_Recv with the process ID of the sending process as its source argument. On the other 
process MPI_Recv will block until the sending process has called MPI_Send. ‘Thus, blocking MPI 
communication inherently leads to a synchronization of the involved processes. However, it can 
also lead to deadlocks, if the MPI_Send and MPI_Recv calls are not correctly paired. 

Collective communication is used when a message is required to reach multiple processes at once, 
or when data is to be reduced. Collective communications can be grouped into different cases: 


One to all communication: Data is sent from one process to all others. E.g. MPI_Bcast. 


All to one: Data is sent from all processes to one root process, potentially involving a reduction 
operation. E.g. MPI_Reduce. 


All to all: Data is sent from all processes to all other processes, potentially involving a reduction 
operation. E.g. MPI_Allreduce. 
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Figure 3.18: [llustration of an AMPI communication pattern. A primal MPI_Send generates an 
adjoint MPI_Send in opposite direction. 


Processor 1 
eae aaeeas Interpretation 


The specific communication patterns, used to implement the collective communications, are 
not specified by the standard and are implementation dependent (popular MPI implementations 
are OpenMPI, IntelMPI, and MpiCH). 


3.5.2 Adjoint MPI 


Adjoint MPI (AMPI) is a library developed at STCE [SN12], in cooperation with Argonne National 
Labs and the Institute for Research in Computer Science and Automation (Inria) [Utk+09]. 
It introduces the concept of adjoints to MPI. The challenge for adjoints in the context of 
parallel communication is the split between (augmented) primal evaluation and reverse adjoint 
propagation. Adjoint information needs to be propagated back from the outputs of the program 
to the inputs. For data that is received in the augmented primal evaluation, using regular MPI 
calls, the corresponding adjoints are not incremented on the remote process during the adjoint 
reverse propagation. ‘Therefore, the adjoints on the remote process are incomplete. 

In order to correctly evaluate the adjoints in presence of MPI communication, the MPI calls in 
the augmented primal section have to be accompanied by corresponding MPI calls in the adjoint 
reverse section, which distribute the adjoints back to the relevant processes. A basic case for this 
is illustrated in Figure 3.18. Data is sent from one process (Po) to another process (P,) with a 
pair of blocking MPI_Send and MPI_Recv calls. In the reverse section the direction of the calls is 
switched, transferring adjoint information from P; to Pp. 

The AMPI library provides a set of wrapper functions for the regular MPI calls. In the 
augmented primal section, AMPI keeps track of the MPI calls and then passes the calls on to 
the MPI library implementation. To track the MPI calls, AMPI keeps a list of all executed calls 
including the required meta data, such as message source and destinations, tags, and message 
length. In order to execute the required MPI calls during the reverse interpretation phase, AMPI 
inserts callback objects (using the same mechanisms introduced in Section 3.4) into the dco/c++ 
tape, which are called when adjoint data needs to be distributed using MPI during the adjoint 
reverse propagation. 

For collective communications, the adjoint communication patterns are more complex. The 











adjoint of a variable x needs to be incremented for every statement which uses it nonlinear in 2. 
Thus, if the local variable x is sent to multiple processes (e.g. using MPI_Broadcast) the adjoint % 
needs to (potentially) be incremented by adjoints sent back from all those processes. Hence, 
a primal call to MPI_Broadcast is augmented by a call to MPI_Reduce with reduce operation 
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L+ = Y1 + Y2 + ¥3 


Figure 3.19: Primal (black) and adjoint (blue) communication for MPI_Bcast. The MPI_Bcast 
call in the primal section, distributing value x from root Po to all other processes, is transformed 
into a MPI_Reduce call in the reverse section, summing up the partial adjoints on the root Po. 





MPI_SUM in the reverse interpretation, which sums up all partial adjoints of x from the involved 
processes, and then increments the adjoint of the root process by this sum. ‘This communication 
pattern is illustrated in Figure 3.19 for four processes and root process Pp. 

Table 3.5 lists the primal MPI calls and the corresponding calls during reverse propagation for 
some common MPI calls. It has been shown that all common one-sided MPI operations can be 
adjoined with a constant overhead, compared to the primal MPI call, with the exception of the 
MPI_Reduce operation with MPI_PROD as the reduction operator |Sch14|. The number of adjoint 
increments required for a product reduction scales linearly with the number of involved processes, 
due to the product rule. 

AMPI can be interfaced with different AD tools, requiring that these tools provide a set of 
interface functions. An interface for dco/c++ is included in the open-source AMPI release. 





3.5.3 Combining AMPI with OpenFOAM 


OpenFOAM implements distributed messaging in a hierarchical fashion, by wrapping low level com- 
munication routines in a separate layer. This is organized as follows. On the high level, information 
is stored in data structures specific to the CFD domain. For example, GeometricField<scalar, 
fvPatchField, volMesh> stores a field of scalar typed values. It includes references to the 
boundary conditions and the underlying volume mesh. For convenience, the definition is abbre- 
viated to volScalarField via a typedef. The communication of (field) data between different 
processes is abstracted into a library called Pstream, which allows the implementation of different 
parallelization strategies. However, at the moment only MPI communication is implemented in 
the regular OpenFOAM release. 





Table 3.5: MPI routines used in OpenFOAM, their AMPI versions, and communication patterns 
specific to the augmented primal and adjoint sections of the adjoint code. 





Primal MPI call AMPI call MPI call in reverse section 


MPI_Bsend AMPI_Bsend MPI_Recv 
MPI_Recv AMPI_Recv MPI_Bsend 
MPI_Allreduce AMPI_Allreduce MPI_Allreduce 
MPI_Bcast AMPI_Bcast MPI_Reduce 
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class AmpiTypeHelper{ 
private: 
const std::type_info& type; 
Foam::string caller; 
AmpiTypeHelper(const std::type_infok type, Foam::string caller) 
type(type), caller(caller){} 
AmpiTypeHelper () ; 
pubic: 
template<typename T> 
static const AmpiTypeHelper create(Foam::string caller=""){ 
const AmpiTypeHelper t(typeid(T), caller); 
Trebwrih os 
} 
bool operator ==(const AmpiTypeHelper &b){ 
return (b.type == this->type); 
if 
bool is_active_type() const { 
return type == typeid(dco::gais<double>::type) 
|| type == typeid(Foam::Vector<dco::gais<double>::type >) 
|| type == typeid(Foam::Tensor<dco::gais<double>::type >); 
Ir 
}; 


Listing 3.12: Type helper, used to determine the type of data passed to the low level 
communication routines. For debug purposes, the caller function can also be passed along. 


To simplify the low level communication routines, the OpenFOAM development team decided to 
not pass down type information from the high level data structures to the low level communication 
routines. All message data is casted to the char data type (which occupies one byte in C) before 
being passed to the low level routines (more specifically just the data pointer passed to the low level 
is cast to char*). This allows the use of the MPI_BYTE data type inside the MPI send and receive 
routines for all types of data, avoiding specializations of the low level communication routines for 
different data types. As OpenFOAM knows which data to expect on the high level, the type of 
the received data is not needed. However AMPI, which wraps around the MPI routines on the 
low level, needs to be able to separate floating point data from passive data (e.g. integers, strings). 
For passive data, the primal communication routines can be kept unchanged, as no derivative 
information is associated with them. For (active) floating point data, the communication routines 
have to be changed to accommodate the reverse propagation of adjoints. 








To incorporate the type information into OpenFOAM, instead of rewriting the whole communi- 
cation layer, a type helper is introduced. It is inserted into all calls to the low level MPI read and 
write routines. The code for the AmpiTypeHelper is shown in Listing 3.12. The type helper is 
constructed from an arbitrary variable and stores its std: :type_info information. This allows 
to boolean compare the type of the variable to another type. It offers a routine which returns 
true if the corresponding type is or contains a dco/c++ active type. With that information the low 
level communication routines can be augmented to decide whether to issue a MPI or AMPI call. 
An AMPI call only needs to be issued if active types are involved, and the augmented primal is 
executed (i.e. the tape is active) while the call to MPI is executed. 


The procedure outlining the (adjoint) communication flow for standard OpenFOAM and 
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Figure 3.20: Unaugmented data flow from high level classes to communication layer. 
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Figure 3.21: Type augmented data flow from high level classes to communication layer. ‘Type 
information allows AMPI to determine whether adjoint communication patterns need to be 
applied. 
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discrete adjoint OpenFOAM using AMPI is illustrated in Figures 3.20 and 3.21. 


To give an understanding of the amount of (adjoint) communication during a typical solver 
execution, Table 3.6 shows the total number of (A)MPI calls for 10 iteration steps of the 
adjointSimpleFoam solver on the Pitz-Daily case. ‘The domain is decomposed onto 2, 4, and 8 
processors. Listed are MPI calls, that is calls to the standard MPI routines without involvement 
of AMPI, AMPI calls, that is calls to MPI routines from inside the AMPI wrapper routines 
during the augmented primal evaluation, and AMPI_ MPI calls, that is calls to MPI routines 
issued by AMPI during the reverse propagation sweep. 





OpenFOAM provides its own implementations for collective operations (instead of relying on 
MPI_Gather and MPI_Scatter), which are implemented using the basic MPI_Send and MPI_Recv 
routines. ‘Therefore, the only MPI operations present in significant number are blocking MPI_Bsend 
and MPI_Recv pairs, and MPI_Allreduce (with the MPI_SUM reduction operator). MPI_Allreduce 
is mainly used to calculate the residual of the linear solver iterations, while the send and receive 
calls are used to distribute data needed for the matrix vector products. 

From the table it is obvious, that if the differentiation of the linear solvers is performed 
symbolically, only very few AMPI calls remain (only around 2% of all calls to MPI need to be 
passed through AMPI). For SDLS, the communication of data needed for the matrix vector 
products in the iterative linear solvers move from AMPI calls to MPI calls. This significantly 
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Table 3.6: Calls to different (A)MPI routines for 10 steps of adjointSimpleFoam on the 2D 
Pitz-Daily case. Tabulated are decompositions to 2,4 and 8 processors, with and without 
SDLS. 


no SDLS SDLS 

2 A 8 2 A 8 
MPI_Bsend 0 0 0 A5738 145578 ul arare’ 
MPI_Recv 0 0 0 A5738 145578 517770 
MPI_Allreduce 10 20 AO A0414 108592 333896 
AMPI_Bsend 23758 74946 265298 640 1920 7040 
AMPI_Recv 23758 74946 265298 640 1920 7040 
AMPI_Allreduce 20718 55116 168112 166 Bo 664 
AMPI_Waitall 20 36 56 20 36 56 
AMPI_MPI_Bsend 23758 74946 265298 640 1920 7040 
AMPI_MPI_Recv 23758 74946 265298 640 1920 7040 
AMPI_MPI_Allreduce 20716 5oL12 168104 164 328 656 
Total MPI calls 352 1010 2358 1382232 400738 1371754 
Total AMPI calls 136506 410120 1397716 2930 8448 29788 
Total AMPI MPI calls 68232 205004 698700 1444 4168 14736 
‘Total sum 205090 616134 2098774 136606 413354 1416278 





reduces the number of calls to MPI routines during the reverse propagation phase and also lowers 
the number of adjoint callback objects in the dco/c++ tape, reducing run time and memory. 

The inclusion of SDLS comes at the cost of additional linear solver calls during the reverse 
propagation phase, which in turn issue additional MPI calls. ‘Therefore, the number of MPI_Send 
and MPI_Recv calls for the SDLS case is higher than the corresponding AMPI_Send and AMPI_Rev 
calls for the non-SDLS case. However, for the studied case the increase is lower than a factor of 
two, meaning that the reverse propagation issues less MPI calls than the AD implementation 
does during reverse propagation. As the number of iterations during primal and reverse linear 
equation solves is not necessarily equal, the opposite could also be true for a different case. 

The additional steps required to incorporate SDLS in an AMPI setting are discussed in detail 
in Section 3.5.5. 








3.5.4 Tangent MPI Communication 


The tangent data type of dco/c++ can be used with MPI with only minor modifications. The 
memory layout of the (scalar or vector) tangent type always bundles together a value with its 
corresponding tangent(s). This contiguous memory layout allows to reuse the existing MPI send 
and receive routines with correspondingly increased message size. ‘he message size doubles for 
tangent scalar mode and increases by factor (1 + n,) for tangent vector mode. 

When using the built in MPI reductions of MPI_Allreduce, for some reductions additional 
logic is required. The reductions assume that each entry of the sent data stream contains an 
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Figure 3.22: Reduction of tangent data using MPI_Sum. The MPI reduction treats the two 
tangent types as four floating point values and sums them up individually. 





element for which associativity and commutativity with other elements holds. For a vector of 
tangent data types, the primal values and tangents are stored interleaved, however MPI interprets 
the vector as a (bigger) vector of primals. 

Inspection of the OpenFOAM source code reveals, that only two of the built-in reductions 
are used, namely MPI_SUM for summation and MPI_MIN for finding the minimum value across a 
vector. For MPI_SUM no special actions have to be taken, because the tangent of a sum of two 
tangent types is the sum of its tangents. The built-in sum reduction will thus calculate the correct 
primals and tangents by default. Note, that this only holds, because the MPI_SUM reduction sums 
up vectors element-wise without changing their dimension, and thus primals and tangents are 
summed up without mixing. This reduction is illustrated in Figure 3.22. 

For all remaining reduce operations (of which only MPI_MIN is currently used in the OpenFOAM 
code base), we enforce the use of a manual reduction implementation, which separates the 
communication of data and the reduction operation. As this manual reduction calls functions 
that are covered by the AD tool, instead of the MPI intrinsic reduce operations, the correct 
tangents will be calculated. 


3.5.5 Parallel Symbolic Differentiation of Linear Solvers with AMPI 


The symbolic differentiation of (sparse) linear solvers becomes more complex when parallel 
communication is involved. For distributed CFD calculations, commonly a ghost cell approach is 
used, where values of remote domains are cached locally and updated only on demand to reduce 
communication complexity. In a black-box differentiation setting, the update of the boundaries 
will be passed through, and handled by, AMPI. However, if such information is passed during a 
passive section, e.g. during the symbolically differentiated linear solver calls, the flow of adjoints 
is interrupted. The corresponding adjoint information has to be supplied in a callback function, 
that is called during the reverse interpretation sweep, as previously discussed in Section 3.4. 





In order to correctly adjoin the boundary communication, it is important to first understand how 
processor boundaries are treated. The geometric domain decomposition utilized in OpenFOAM 
leads to both decomposed matrix and vector entries. Each processor holds a subset of the solution 
vector and a subset of coefficients of the discretization matrices. The entries of the solution vector 
correspond to values defined on cells/faces located in the decomposed domain. 





The global coefficient matrix is decomposed row wise to the individual processors. The matrix 
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const label nCells = diag().size(); 

const label nFaces = upper().size(); 

// diagonal (cell) coefficients 

for (label cell=0; cell<nCells; cellt+t+)f{ 
ApsiPtr[cell] = diagPtr[cell]*psiPtr[cell]; 

} 

// off diagonal (face) coefficients 

for (label face=0; face<nFaces; facet++){ 
ApsiPtr[luPtr[face]] += lowerPtr[face]*psiPtr[1Ptr[lface]]; 
ApsiPtr[1lPtr[face]] += upperPtr[face]*psiPtr[uPtr[lface]]; 

i 


Listing 3.13: Calculation of the local matrix vector product between the LDU coefficients and 
vector psi. 


coefficients stored in the local LDU matrix description correspond to a block diagonal sub-matrix 
in the global discretization matrix and can be multiplied with the contents of the solution vector 
without inter-processor communication. Matrix entries which lie outside of the diagonal block 
require multiplication with vector entries that are not located on the same processor. The 
corresponding vector entries need to be communicated to the local processor, in order to be 
added to the matrix vector product. Those matrix entries are stored separately from the LDU 
description and are called matrix interfaces. ‘The corresponding entries of the solution vector are 
called interface values. 

OpenFOAM separates the matrix vector product into two parts. In the first part the product 
between all local matrix coefficients and local vector entries is calculated. This part involves no 
MPI communication and therefore there is no need to supply additional AMPI information. 

The second part involves the multiplication of the matrix interface coefficients with the interface 
values. For each row of the coefficient matrix, the values of the interface coefficients need to be 
brought in from remote processes to the local process, calculating the part of the matrix vector 
product corresponding to the row. 

To hide communication latency, the implementation separates the parallel matrix vector 
multiplication into three stages: 





Preparation of matrix interfaces: Send matrix interface coefficients B to the corresponding 
processors. For non-blocking communication, this allows to start with next step before 
interfaces are received. 





Multiplication of local coefficients: Calculate local matrix vector product according to List- 
ing 3.13. Indirect memory access to both the result and the entries of the multiplicant 
vector, due to the LDU addressing. (Caching friendly if the mesh is numbered efficiently. ) 


Multiplication of interface coefficients: At a later stage add missing product terms to the result 
according to Listing 3.14. Entries of the interfaces are already matched correctly, therefore 
indirect memory lookup only occurs for the result vector. 
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const labelUList& faceCells = this->interface().faceCells(); 


forAll(faceCells, elemI){ 
result[faceCells[elemI]] += coeffslLelemI]*vals[elemI]; 
} 


Listing 3.14: Calculation of the matrix vector product between processor interface coefficients 
coeffs with entries of the remote vector psi, stored in vals. Entries of coeffs and vals are 
already correctly aligned, therefore no indirect access on the right hand side. Implementation 
in lduInterfaceFieldTemplates.C 
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Figure 3.23: Structured 4 x 4 mesh, decomposed onto four processors Po—P3. Cells are numbered 
such that distance between local cells is minimal. To compress notation, cells are numbered to 
a hexadecimal base. 





An example for the mesh decomposition, the LDU decomposition and the global and local 
coefficient matrices is given in Figure 3.24. illustrated is a structured 4 x 4 grid, depicted in 
Figure 3.23, decomposed onto four processors. The global cells are numbered such that the 
maximum (index) distance between two cell indices on the individual processors is minimal. To 





obtain a more compact notation in the global coefficient matrix, we number the 16 cells in a 
hexadecimal base from cp to cr. There are processor boundaries between processor pairs (Po, P1), 
(Po, Po), (P1, P3), and (P2, P3). The vectors l,d,u in the figure list the processor local entries. 
The index vectors L and U give the addressing for J and uw. 





For each processor boundary, two entries of the solution vector x have to be copied from the 
remote process. For example, v?! = [21,73] is copied from P®° to P! and v'? = [24,26] in the 
opposite direction. 

For convenience, the numbering in the figure corresponds to global cell numbers. In the 
implementation each processor numbers its local cells from zero. Coefficients outside of the diagonal 
blocks (that is coefficients of faces on a processor boundary), require additional communication. 
Entries of the distributed vector must be brought to the processor, computing the corresponding 
rows of the matrix vector product. For example, uw i4 and u36 on the processor boundary (P0, P1) 
require the computation of the product between coefficients associated with processor Po and the 


parts of the solution vector located on Py, i.e. x4 and x6. 
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doo Uo1 Uo2 XLO 
lio di u13 | Uta L1 
loo d22  U23 U28 x2 
I3i [32 — d33 U36 U39 3 
lay LA 
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Udd L7 
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Lee leg deg Ucd Uce Le 
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Figure 3.24: Block diagonal matrix, discretizing the mesh from Figure 3.23. Colored coefficients 
on the main blocks are local to the processors and can be multiplied with the corresponding 
entries of the vector x without further communication. Entries in off-diagonal blocks need to 
be communicated to corresponding processors, to correctly evaluate the matrix vector product. 


The SDLS implementation presented in Section 3.4 calculates the adjoints of the local LDU 
coefhicients and vector entries only. In order to correctly capture the adjoints of the parallel 
matrix vector product, we need to also symbolically calculate the adjoints of the matrix interface 
coefficients and the interface values. ‘The calculation of the adjoints is outlined in Algorithm 2. 
Lines 1—20 calculate the adjoints of the local coefficients as outlined in Section 3.4. For symmetric 
matrices, the adjoints of the off-diagonal coefficients need to also be incremented by the adjoints 
of the omitted opposite part (Lines 10 and 18). 

Lines 21-28 list the calculation of the outer product —b- x? for the interface coefficients. In 
order to calculate the adjoints, the interface values of the remote processors need to be brought 
into the local scope (Line 23). Using the remote value of x and the local value of b, the individual 
values of the outer product can be formed. Lines 21—27 directly correspond to Lines 37—55 of the 
implementation shown in Appendix B.3. 
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Differentiation of Complex Iterative CFD Algorithms 


Input: primal solution x, incoming adjoints x 


Data: matrix coefficients: (l,d,u), du Addressing (L,U), boundary coefficients B 


Output: adjoints b, (1, d, u) ,B 
b + b+ solve(A’, x); 
forall diagonal entries d; at index i in d do 
| dp dp — by ay: 
end 
forall lower entries |; at index i in l do 
JHU; 
ke L;; 
Le ieee 
if A symmetric with no upper part then 
| l,l, - by 7X5; 
end 
end 
forall upper entries u; at index i in u do 
je Li; 
ke U;; 
Uj — U; — dj - LE: 
if A symmetric with no lower part then 
| Ui — Uj — by 25; 
end 
end 
forall processor boundary fields p; with index 7 do 
update boundary cells of x and b on patch p,; 
x* < boundary cell values of x from neighboring processor; 
b* < boundary cell values of b from this processor: 
forall faces f; with index i on p; do 
| Bi, > Bi, = b™ : i 
end 
end 
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3.6 Higher Order Differentiation 


Obtaining derivatives of order higher than one might be desired for a multitude of reasons. 
Second order derivatives can e.g. be used to implement a Newton optimization scheme (see 
Section 2.9.2). Further an additional optimization step, operating on the optimized parameters, 
might be desired. This motivates the use of higher derivatives. As already outlined in Section 2.7.2, 
higher derivatives can be obtained by repeatedly applying the first order model of AD. 

The AD tool dco/c++ allows the nesting of first order data types to obtain derivatives of 
arbitrary order. Nesting a tangent data type inside an adjoint type yields the tangent over 





adjoint model ee 


typedef dco::gais<dco::gtis<double>::type>::type doubleScalar; 


Analogously a tangent data type nested inside another tangent data type yields the tangent over 
tangent model «(:?), 


typedef dco::gtis<dco::gtis<double>::type>::type doubleScalar; 


The nesting of data types to obtain second order derivatives, as well as the access routines of 
dco/c++ required to extract the individual derivative components, are shown in Figure 3.25. 

If implemented naively, the calculation of a Hessian H € R”*” requires n evaluations of the 
tangent over adjoint model of AD and consequently n full evaluations of the flow equations. For 
cases like topology optimization, the influence of a parameter to the final objective propagates 
iteratively through the whole flow domain in a non-linear fashion. This propagation leads to a 
dense Hessian. Thus, coloring techniques |GMP05| which can be used to reduce the number of 
evaluations of the AD model required for sparse matrices are not applicable in this case. This 
makes the computation, and also the storage, of Hessians expensive, such that approximative 
Quasi-Newton methods like BFGS are a better fit to speed up optimization convergence. 

However, for lower dimensional parameter spaces (see e.g. parametric optimization, Section 4.6), 
problems which are known to exhibit sparsity, or applications where higher order derivatives 
are needed, higher order adjoints are the superior choice over FD. An example driver for the 
evaluation of the full Hessian using scalar tangent over tangent mode is given in Listing 3.15. 

Improving on the scalar tangent over tangent model, a whole block of the Hessian can be 
extracted at once using the tangent vector mode (see Section 2.8). For a vector size d, this 
mode allows to extract a sub block S € R?*4 of the full Hessian H with one evaluation of the 
augmented primal function. Listing 3.16 shows the seeding procedure to retrieve the Hessian of 
an arbitrary function f : R” — R under the assumption that the vector size d is identical to n. 
Because the memory consumption of two nested vector types corresponds to O(d?) this is not 
feasible for even modestly sized problems. Hence in Listing 3.17 we choose d < n and extract the 
smaller Hessian sub blocks S one by one. The calculation of the Hessian in blocks also allows to 
exploit the symmetry of the Hessian. 

Using the tangent over adjoint mode of AD, the computational complexity of calculating the 
whole Hessian is lowered from O(n*) to O(n), assuming that m is one or a constant with m <n. 
The tape has to be re-recorded for each tangent direction, therefore each tape is only interpreted 
once. Each evaluation of the tangent over adjoint model yields one row/column of the Hessian. 
The seeding of the adjoint and tangent directions for a general function f is shown in Listing 3.18. 

Both tangent over adjoint and tangent over tangent models are implemented in discrete adjoint 
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Figure 3.25: Interface calls allowing to access specific components of the higher order models. 
dco: :derivative() and dco::value() abbreviated as der() and val(). 


1} typedef dco::gtis<dco::gtis<double>::type>:: type ADtype; 
2 extern ADtype f(ADtype* x,int n); 


3 


A-Vead = t2omrds Ging sm, 


ADtype* x, double** H){ 


5 // seed all directions, evaluate f n*n times 
6 for(int i=0; i<n; it+){ 

4 dco::value(dco::derivative(x[i])) = 1.0; 

8 for(int.) j—1, jte) % 

9 dco::derivative(dco::value(x[j])) = 1.0; 
10 ADtype y = f£(x,n); 

11 HLil Lj] = Hij] LaW&=™Waco: : derivative (dco:: derivative (y)); 
12 dco::derivative(dco::value(x[j])) = 0.0; 
13 } 

14 dco::value(dco::derivative(x[i])) = 0.0; 
15 } 

16| } 


Listing 3.15: Dense seeding for obtaining a full Hessian of function f : R” — R in scalar tangent 


over tangent mode. 
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1} typedef dco::gtiv<dco::gtiv<double ,d>::type,d>::type ADtype; 
2);extern ADtype f(ADtype* x,int n); 


4 void t2v_tiv(int n, ADtype* x, double** H){ 

5 // seed all directions, evaluate f once 

6 for(int i=0; i<n; it+){ 

7 dco::value(dco::derivative(x[i]){[i]l) = 1.0; 

8 noone (Catiote ye 20) Rh a) <SiclG | qj arar)) 

9 dco::derivative(dco::value(x[j])) Lj] = 1.0; 


10! +} 

ity ADivype: y= 2 (xn); 

12 for(int i=0; i<n; itt) 

13 fori | One tt) 

14 HLillj] = dco:: derivative (dco: :derivative(y) [jap [ Mh; 
15| } 


Listing 3.16: Seeding of vector tangent over vector tangent mode, under the assumption that 
the vector size d of gtlv::type equals problem size n. 


typedef dco::gtiv<dco::gtiv<double ,d>::type,d>::type ADtype; 
extern ADtype f(ADtype* x,int n); 


1 
2 

3 

4 void t2v_tiv(ADtype* x, int n, int d, double** H){ 
5 for(int i=0;i<n;it=d){ 

6 for(int j=0;j<=i;jt+=d){ // use symmetry 

7 for€int k=i; k<std::min(itdjn) »wokt++){ 

8 
9 


dco::value(dco:: derivative (x[k]) [k/4d]) = 1.0; 
for(int 1l1=j; l<std::mim( jan); 1++){ 
10 dco::derivative(dco::value(x[1])) [14d] = 1.0; 
rfl } 
12 } 
13 ADtype y = f(x,n); 
14 for(€int k=i; k<std::min(itd,n); k++){ 
15 dco::value(dco::derivative(x[k]) [k/4d]) = 0.0; 
16 for (int®l=R 1<eitd::min(j+d,n); 1++)¢{ 
iby H{k] [1] = HL1J(k] = dco:: derivative (dco::derivative(y) Lk/d]) [1%d]; 
18 dco:: derivative (dco::value(x[1])) [14d] = 0.0; 
19 } 
20 } 
21 } 
22 } 
23| } 


Listing 3.17: Seeding of vector tangent over vector tangent mode for arbitrary matrix size n 
and tangent vector size d. 
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typedef dco::gais<dco::gtis<double>::type>:: type ADtype; 
extern ADtype f(ADtype* x,int n); 
void t2s_ais(int n, ADtype* x, double*x* H){ 
J / seed fall directaons wey alvaves: “a timc. 
for(int i=0; i<n; itt+){ 
dco::derivative(dco::value(x[i])) = 1.0; 
ADtype y = f(x,n); 
dco::value(dco::derivative(x[i]J)) = 1.0; 
ADmode:: global_tape->interpret_adjoint () ; 
Gre (lime =OG  y]<ine  jfarae) al 
Hlillj] = deo. .derivative(dco: -derivative,)); 
Ir 
dco::derivative(dco::value(x[iJ)) = 0.0; 
i 
J; 


Listing 3.18: Seeding of scalar tangent over scalar adjoint mode. 


OpenFOAM. They did not require any changes in the code base, compared to the first order 
models (except the handling of checkpoints, if checkpointing is required). This leads us to believe 
that the computation of derivatives of order three and higher can also readily be implemented, by 
introducing the relevant nested data types. ‘The required configuration options to enable second 
order derivatives in discrete adjoint OpenFOAM are listed in Appendix A.2. 

In our observations, using second order AD to compute Hessians to speed up the convergence 
of optimization methods proved to be ineffective, due to the run time overhead introduced by the 
additional evaluations of the augmented primal. The potentially lower number of optimization 
steps required do not outweigh the increase in run time per optimization step. Furthermore, the 
convergence path of the topology optimizer to a local minimum proved to be quite noisy, leading 
to poor convergence of second order optimization methods, as they tend to choose small gradient 
step sizes. 

The run time factors for a higher order solver of the simpleFoam solver are listed in ‘Table 3.7. 
Listed is only the run time for a single evaluation of the augmented primal. To obtain a full 
Hessian, the models must be evaluated repeatedly. 





Higher order sensitivities can be beneficial for parametric optimizations with limited number 
of parameters, as the run time overhead directly scales with the number of parameters for the 





Table 3.7: Run times for a single derivative evaluation in first and second order tangent mode. 
Run time factor for tangent vector mode (vector size 16) is for a single gradient entry. 


Solver Run time (s) Factor 
simpleFoam 7.14 1 
tisSimpleFoam 28.73 3.71 
tivSimpleFoam A51.47 3.65 
t2stisSimpleFoam 100.29 12.96 
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tangent over adjoint model. In the parametric optimization procedure, introduced in Section 4.6, 
approximately constructed Hessians are used. In the future those approximations could be readily 
replaced by exact derivatives. 


3.7 Profiling of Primal and Adjoint CFD Solvers 


3.7.1 Compilation of Discrete Adjoint OpenFOAM 


The compilation of (discrete adjoint) OpenFOAM requires a small subset of C++11 and C++14 
features (mostly for template resolution). Therefore, a somewhat recent compiler is required to 
compile the discrete adjoint OpenFOAM framework. ‘The compilation has been tested with gcc, 
clang, and the Intel icc compiler and different MPI implementations (OpenMPI, Intel MPI). 





In ‘Table 3.8 the compilation time for the OpenFOAM core package and a subset of solvers for 
adjoint mode (als), tangent mode (tls) and passive mode are listed. For the release binaries, 
optimization flags (-03) are set and debug symbols are disabled. 

Between passive mode and als mode, the compile time increases by roughly 30%, regardless of 
the used compiler. For adjoint and passive mode, clang compiles the fastest, for tangent mode 
the compile times between all compilers are roughly on par. 








For all configurations (including passive mode), a template instantiation depth of at least 41 is 
required. In absence of a documented compiler statistic reporting the instantiation depth, this num- 
ber was obtained by gradually increasing the allowed template depth via the -ftemplate-depth 
compiler flag, until no errors occur during compilation. ‘This finding underlines the complexity 
and reliance on templating of the code, making modes of AD other than operator overloading 
very challenging. 


3.7.2 Identifying Hotspots 


In order to understand the impact of AD on the run time and memory behavior of a given 
problem, it is important to accurately quantify where most of the run time and memory is spent 
during program execution. For run time profiling, several different approaches were tried. Namely 
profiling with gprof, the callgrind tool of valgrind and gperftools. All of these methods 
work by polling the program state at fixed (high frequency) intervals, extrapolating from these 
intervals how much time is spent in which subroutines. The function names and other program 
internals are obtained from the debug symbols inside the executable. 





Table 3.8: Compilation times of discrete adjoint OpenFOAM (optimized build with -03) on 
4 cores with 8 threads for passive, adjoint (A1S) and tangent (T1S) mode with different 
supported compilers. 


Compiler A1S (min) T1S (min) Passive (min) 


g++ 4.9 A2.05 33.60 31.05 
g++ 5.4 40.55 32.00 29.60 
g++ 6.3 40.50 32.40 30.20 
clang 3.8 37.50 32.55 26.20 
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For assessing the memory consumption related to the adjoint mode, we use function level 
instrumentation. Function instrumentation allows to inject additional code, which is executed at 
entry and exit of each function (also for inlined functions). This can be used to precisely track 
the tape size, as well as current execution time at the entry and exit of each function call. 

Three different approaches to inject the additional function calls into the code were investigated: 


Using constructors and destructors. An auxiliary object is constructed at the entry of each 
function. The constructor and destructor (which is called once the function has returned 
and the object runs out of scope) of the object can be used to call the custom functions, 
which perform the run time and memory tracking. The auxiliary function is inserted into 
the code by performing a code transformation with the clang-rewriter tool!. This tool 
allows to identify function declarations in the AST, subsequently inject statements into the 
AST and transform them back to code. One advantage of this method is that the name and 
signature of the executed functions can be passed to the profiler directly as a string. ‘The 
disadvantage is that the code rewriting process is complex and time consuming. It tends to 
miss certain functions, e.g. functions which are generated by C preprocessor macros. 





Using the -finstrument-functions feature of gcc and clang. The -finstrument-functions 
compile flag defines the functions func_enter and func_exit. ‘These functions are automat- 
ically called at the entry and exit of each function call. They can be overloaded to perform 
the desired actions. ‘The advantage is that the function hooks are directly injected by the 
compiler, and thus no function calls are missed. One disadvantage is that the function 
name and signature of the instrumented functions are not available readily, but have to be 
reconstructed from the debug symbols in the executable. ‘The lookup from debug symbols 
is rather expensive, though the overhead for repeated lookups can be kept low with a hash 
map implementation. 





Using the -fxray-instrument feature of clang-5.0. This feature, recently added to Ilvm clang, 
allows to instrument functions much like -finstrument-functions, but gives greater 
control over the granularity of instrumentation through file and function lists. Furthermore, 
it respects an instrumentation threshold that allows to exclude functions with very high 
call counts but minimal impact (e.g. operator []). 


Here we focus on the second approach, as it produced the most reliable results. ‘The third option 
would be preferred, due to the better granularity control, but the implementation in Ilvm seems 
incomplete at this point in time and documentation is lacking. 

Instrumenting all functions produces a lot of data, which on a very fine granularity level is not 





of much use and considerably slows down program execution, as the instrument functions are not 
inlined by the compiler. ‘Therefore, we deliberately disabled the instrumentation for regions of the 
code that mostly concern the handling, calculation, and storage of data on a very low level (e.g. 
operator[] on vectors). The influence of these omitted functions is not lost, but accumulated 
into functions higher up in the call tree. Instrumentation was omitted for the src/OSspecific/ 
and src/OpenFOAM/ folders in the source tree. Instrumentation of functions outside of the scope 
of OpenFOAM was disabled by blacklisting code in the system folders /usr, /1ib and /etc. 

In instrumentation mode, dco/c++ creates a trace of the program execution, which lists all 





encountered functions in chronological order, as well as the connectivity of the functions. From 
this information, the full call graph of the program can be reconstructed. For every (instrumented) 
function entered and exited, the following information is stored: 


‘https: //clang.llvm.org/doxygen/classclang 1 1Rewriter.html 
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Number of children: Number of functions called by the function. 

Memory size: Number of tape entries created by the function excluding children. 
Cumulative memory size: Number of tape entries created by the function including children. 
Input count: Number of adjoint inputs. 

Output count: Number of adjoint outputs. 

Function name: Name of the function, stripped from function and template arguments. 


To obtain a run time profile of adjointSimpleFoam, we run the instrumentation on the laminar 
problem of the angled duct introduced in Section 3.1.2. Due to the high run time overhead of the 
instrumentation mode, only one iteration of the adjointSimpleFoam solver is traced. 

In Figure 3.27 the memory sizes for all (direct) child functions of the main function are shown. 
The blue bars show the tape memory used for a single step of the adjointSimpleFoam solver 
in black-box mode. The red bars show the same functions, but with symbolically differentiated 
linear solvers. Consequently the red bars are considerably lower (observe the logarithmic scale of 
the x axis) than the blue ones for the linear solver calls and identical for the rest. Consequently, 
the main complexity (at least in terms of memory consumption) shifts away from the linear 
solvers to different places in the code. 

In Figure 3.28 we show the same information for the Pitz-Daily example, with k-e turbulence 
model. We see the same general behavior as with the laminar case, but with an additional 
memory spike for the correction of the turbulence model. The kEpsilon::correct() routine 
embeds the two linear solvers, required to solve the turbulence equations. Therefore, the memory 
consumption differs between black-box and symbolically differentiated linear solvers. 

The full information of the instrumentation trace is best explored interactively, allowing to limit 
the high information density to regions of interest. A screenshot of a tool developed to visualize 
the call history is shown in Figure 3.26. The figure shows a breakdown of the function calls, 
including sub calls of up to level eight, for one iteration on the Pitz-Daily case. The hierarchical 
function calls are arranged in a sunburst diagram. The area of the slices correspond to run time 
or tape memory size. 


Iason Ae 





Figure 3.26: Screenshot of interactive instrumentation visualization tool. 
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Sum > 

fvm::Sp — " 

fv Matrix::relax 
solve 
fvMatrix::A 
divide 
fvMatrix::H ~ 7 
fve::flux — - 

divide 
fvc::interpolate 
multiply 
fve::div 
fvm::laplacian — - 
multiply — . 

fv Matrix::solve 
fv Matrix::flux 
fve::div 
multiply 


fve::grad - - 





CostFunction::eval — 7 
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Figure 3.27: ‘Tape size consumed by function calls in adjointSimpleFoam for the angled duct 
case. Black-box differentiation in blue and symbolic linear solvers in red. 
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Figure 3.28: Tape size consumed by function calls in adjointSimpleFoam for the turbulent 
Pitz-Daily case. Black-box differentiation in blue and symbolic linear solvers in red. 
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3.7.3 Application of Profiling Results 


Inspecting the detailed breakdown of function cost, one can begin to identify routines which 
can be further optimized by exploiting application insight. As an exemplary case study, we will 
inspect the function lduMatrix: :H(), which ranks highly in the list of functions obtained by the 
instrumentation. This function calculates the negative product of the off diagonal entries L,U of 
a sparse matrix A= L+ D+U in 1lduFormat with a vector w: 


Hy =—(L+U)p=(D-A)w. 


The product is implemented component wise in a loop over all faces in src/OpenFOAM/matrices/ 
lduMatrix/lduMatrix/lduMatrixUpdateMatrixTemplates.C as: where Hpsi holds the result 
of the product, lower and upper hold the matrix coefficients J, and u of the lower and upper 
triangular matrices L and U, and 1Addr and uAddr store the row and column indices L and U of 
the matrix coefficients. 

One use case for the H() operator is the assembly of the pressure correction equation from the 
calculated velocities (e.g. in simpleFoam). 

In this application, lower and upper are scalar fields and psi is a vector field (i.e. the 
individual entries psi[1Addr[face]] are in R*). The adjoints can be calculated symbolically 
by the following code, where al_lower, ail_upper, al_psi and ai_Hpsi are the adjoint vectors 
corresponding to lower, upper, psi and Hpsi. This function is a good candidate for introducing 
symbolic adjoints, due to its limited scope and easy to derive derivative. ‘The adjoint code was 
implemented by applying AD to the computational kernel from Listing 3.19 by hand, according 
to the rules introduced in Section 2.7.2. For more complex codes, a source code transformation 
tool could be used. 

The scalar vector multiplication in Lines 3—4 in Listing 3.19 induce scalar products (calculated 
by the OpenFOAM ampersand operator) in the adjoint code (Listing 3.20) in Lines 4 and 8. This 
is illustrated by the following example with scalar A, and vectors x,y € R”: 


y = Ax 


7 Oy ae n—1 


7 Oy\" _ _ 
x= ae -y=Ay. 


Listing 3.20 shows the hand adjoined loop over the faces, calculating the adjoints of the matrix 
entries L and U and the vector yw. Due to the split of primal and adjoint calculation, additional 


// Input: lower, upper, psi 
for (label face=0; face<nFaces; facett) 
1 
HpsiluAddr[face]] -= lower[face]*psillAddr[face]]; 
HpsillAddr[face]] -= upper[lface]*psiluAddr[face]]; 
i 
// Outpmt Hips i 


Listing 3.19: Implementation of the HQ) operator. 
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for (Foam::label face=0; face<nFaces; facet+){ 
// adjoin HpsiluAddr[face]] -= lower[face]*psillAddr[face]]; 
ai_lower[face] -= psillowerAddr[face]] & al_HpsilupperAddrl[face]]; 
ai_psillowerAddr[face]] -= lower[lface] * a1t_HpsilupperAddr[face]]; 


// adjoin HpsillAddr[face]] -= upper[face]*psiluAddr[face]]; 
ail_upper[face] -= psilupperAddr[face]] & al_HpsillowerAddrl[face]]; 
ai_psilupperAddr[face]] -= upperlface] * a1t_Hpsi[llowerAddr[face]]; 


Listing 3.20: Loop over faces to calculate adjoints of H(). 


boilerplate code is needed to correctly insert the symbolic adjoint calculation into the tape. This 
code, in slightly simplified form, is given in Listings 3.21 and 3.22, showing the creation of the 
gap in the tape and the symbolic calculation of adjoints during the interpretation. The process 
using adjoint callback functions is very similar to the implementation of SDLS. 


The code calculates the product Hpsi, and thus this vector is registered as an output of the 
function H(). Its adjoints become an input to the calculation of the symbolic adjoint. 





Due to the product rule, the primal values of lower, upper, and psi are needed in the reverse 
section. Copies of those vectors are saved in a checkpoint. ‘To access elements of the adjoint 
vectors at the correct positions, the LDU addressing needs to be available in the symbolic adjoint 
function as well. The addressing of the matrix indices is assumed to be constant throughout 
the program execution. ‘Thus, only a pointer to it is stored, instead of a full copy. If the mesh 
topology, and thus the connectivity of the faces, changes during program execution, full copies 
have to be saved. 





The vectors lower, upper, and psi are the inputs to HQ) and are registered as inputs with the ad- 
joint helper object D. Those adjoints have to be incremented with the calculated symbolic adjoints 
at the end of the adjoint callback function. 


The size of the checkpoints, input, and output vectors all correspond to the number of cells 
in the mesh. In contrast, the number of tape entries created by the loop, which calculates the 
matrix vector product, over the faces obviously depends on the number of faces in the mesh 
(two assignments are recorded for each face). The symbolic treatment avoids creating those tape 
entries. ‘he improvement achieved by the symbolic implementation is therefore expected to be 
higher the denser the matrix becomes, as the density is determined by the ratio between faces 
and cells. The savings should thus be best for complex polyhedral 3D meshes, where each cell is 
connected to multiple other cells. 








In ‘Table 3.9 we show the memory consumption and run time for the 2D reference cases from 
Section 3.1.2, as well as for the OpenFOAM motorbike case, which has a more complex 3D mesh. 
The savings observed in practice by the symbolic implementation of H() are pretty minor (< 5%), 
due to the rather high amount of data which needs to be checkpointed. However, no cases with 
decreased performance were observed. ‘Thus, the optimization is still worthwhile and serves as a 
template for further optimizations. 
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1 Foam::Field<Foam::vector>> Foam::lduMatrix::H(const Field<Foam::vector>& psi) 

2 it 

3 Field<vector> Hpsi(lduAddr().size(), Zero); 

4 if (lowerPtr_ || upperPtr_){ // only needed if matrix not diagonal 

5 const label nFaces = upper().size(); 

6 

4 ADmode::global_tape->switch_to_passive() ; 

8 D = ADmode::global_tape->create_callback_object<ADmode:: 
external_adjoint_object_t>(); 

9 // register adjoint inputs 

10 forAll(lower(), i) 

ib D->register_input (lowerPtr[i]); 

12 

13 forAll (upper (©, 1) 

14 D->register_input (upperPtrl[li]); 

15 

16 (og NILA (Goyetay a aL 

i D->register_input (psiPtr[i][0]); 

18 D->register_input(psiPtr[li][1]); 

19 D->register_input (psiPtr[i][2]); 

20 } 

21 

22 D->write_data(lower()); 

23 D->write_data(upper()); 

24 D->write_data(psi) ; 

25 D->write_data(&(lduAddr ().lowerAddr())); 

26 D->write_data(&(lduAddr().upperAddr ())); 

27 

28 for (label face=0; face<nFaces; face++){ 

29 HpsiluAddr[face]] -= lower[face]*psiPtr[lAddr[face]]; 

30 HpsillAddr[face]] -= upper[face]*psiPtr[luAddr[face]]; 

31 } 

32 

33 forAll(Hpsi, i)¢f 

34 Hpsili] [0] = D->register_output (dco:: passive_value(Hpsili][0])); 

35 Hpsili][1] = D->register_output (dco::passive_value(Hpsili][1])); 

36 Hpsili] [2] = D->register_output (dco:: passive_value(Hpsili][2])); 

37 $ 

38 ADmode:: global_tape->insert_callback<ADmode::external_adjoint_object_t >(gapH 
»D); 

39 ADmode::global_tape->switch_to_active() ; 

40| 

Al return Hpsi; 

42) } 





Listing 3.21: Primal code for H(), augmented to enable symbolic differentiation in the reverse 
interpretation. 
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1 inline void gapH(typename Foam::ADmode::external_adjoint_object_t *D){ 


2 
3 
4 
5 
6 
vi 
8 
9 


10 
LL 


const Foam::scalarField& lower = D->read_data<Foam::scalarField>() ; 
const Foam::scalarField& upper = D->read_data<Foam::scalarField>() ; 
const Foam::vectorField& psi = D->read_data<Foam::vectorField>() ; 


const Foam::labelUList& lowerAddr = *(D->read_data<Foam::labelUList*>()); 
const Foam::labelUList& upperAddr = *(D->read_data<Foam::labelUList*>()); 


Foam::scalarField ail_lower(lower.size() ,Foam::Zero) ; 
Foam::scalarField al_upper(upper.size() ,Foam:: Zero) ; 
Foam::vectorField al_psi(psi.size() ,Foam:: Zero) ; 
Foam::vectorField al_Hpsi(psi.size() ,Foam:: Zero) ; 


forAli(alonpsd 34) 
for (int j/—-05)<3, 4+) 
ai_HpsiLlillj] = D->get_output_adjoint (); 


Foam::label nFaces = upper.size(); 

for (Foam::label face=0; face<nFaces; facett+){ 
ai_lower[face] -= psillowerAddr[face]] & al_HpsilupperAddrl[face]]; 
ai_psillowerAddr[face]] -= lower[lface] * a1t_HpsilupperAddr[face]]; 


ai_upper[face] -= psilupperAddr[face]] & al_HpsillowerAddr[face]]; 
ai_psilupperAddr[face]] -= upperlface] * a1t_HpsillowerAddr[face]]; 
} 


forAll (lower ,i) 
D->increment_input_adjoint (dco:: passive_value(ail_lower[i])); 


forAll (upper ,i) 
D->increment_input_adjoint (dco::passive_value(al_upper[i])); 


forhllCpsi 51) 


for (inten —Os i<o 7.) 
D->increment_input_adjoint (dco::passive_value(ai_psilillj])); 


Listing 3.22: Symbolic adjoint code for H(). 





Table 3.9: Comparison between regular H() and symbolic adjoint implementation. 


Test case Variant Adjoint vector size Total tape size (MB) Run time (s) 
Ancled du@f regular H() 177467477 6093.35 25.75 
© symbolic HO 171887480 5966.35 25.68 
Pitz-Dail regular H() 121031438 3951.65 10.56 
r symbolic H() 118963913 3897.06 10.50 
Pi ee regular H() 709475801 25049.55 o1.53 
symbolic H() 686295043 24420.04 91.50 
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3.8 Overcoming AD Memory Limits 


3.8.1 File Tape 


The AD tool dco/c++ supports multiple options to store the tape in memory (RAM or disk): 


Blob Tape: Use an adjoint stack of fixed length. Memory is reserved when the tape is created. 
Most performant option, as no bounds need to be checked. 





Chunk Tape: Use an adjoint stack which consists of multiple chunks, that are allocated on 
demand. Most flexible option, slightly less performant. 


File Tape: Use chunks, but buffer full chunks to files on the file system. File IO is handled by 
the kernel and chunks may be fully buffered in memory. Significantly slower than both 
blob and chunk tape, but allows running big calculations when checkpointing and offloading 
with AMPI is not feasible. 


The file tape can be utilized when memory demands are high and checkpointing is hard to 
implement. One such case would be, if for an iterative simulation one time step does not fit into 
the tape, requiring a more granular checkpoint level. ‘To get reasonable performance from the 
file tape, fast memory is needed. ‘The access patterns are sequential writes during the recording 
phase and sequential reads from the adjoint stack and random access read and writes on the 
adjoint vector during the interpretation phase. 

To predict the performance of the file tape, multiple tests were performed on different architec- 
tures. Studied were the performance of a SSD RAID (6* 1024 GB in RAID 0) and an Intel NVME 
installation on the RW'TH compute cluster. For RAID 0, the capacity of a volume is the sum of 
the capacities of the involved disks. The latency and throughput is improved by striping, i.e. data 
of consecutive writes is distributed over multiple disks, reducing the IO bandwidth needed on the 
individual disks. For random access, it also improves latency as multiple read/write operations 
can be performed at the same time. 





Table 3.10: SSD benchmark reading/writing of 10GB of data in 10 MB blocks with GNU dd. 
Average is taken over ten samples and rounded to next multiple of 5 MB. 


Mode dd flag SSD IO (MB/s) NVME IO (MB/s) 


write no flags 950 1570 
direct 1765 1675 
dsync 830 705 
sync 710 740 
nocache 910 1250 

read _no flags 390 1580 
direct - - 
dsync 390 199 
sync 340 1635 
nocache 900 1580 
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Figure 3.29: Read and write IO usage of the SSD/NVME devices for 40 iteration steps. Approx. 
300 GB of data is written to disk during the augmented forward run, and read back during 
the adjoint propagation phase. 


The sustained read and write performance for the SSD and NVME systems, benchmarked by 
the GNU dd command, is shown in Table 3.10. Different modes of disk access are specified by the 
-iflag/-oflag. Data is read/written in consecutive chunks of block size 10 MB. For small blocks, 
the read and write performance is significantly reduced, however as dco/c++ reads and writes the 
adjoint stack in chunks (corresponding to the chunks of a regular tape), the full read and write 
bandwidth should be used by the dco/c++ file tape implementation. For the SSD system, the OS 
caches are flushed after each benchmark run. We do not have root access to the NVME systems, 
the caches can thus not be manually flushed. For the NVME system, the RAM is artificially filled 
as much as possible by another process, to avoid skewing of the results by system level read/write 
buffers. The numbers obtained by these benchmarks correspond to the maximum read/write 
rates observed with dco/c++, with the exception of SSD write speed, where the full bandwidth 
indicated by dd in direct mode could not be achieved. 








To evaluate the file tape performance, we study a case run with adjointSimpleShapeFoam on 
a S-bend geometry for 40 iteration steps. The case consumes 307 GB of tape memory, which 
is written to the hard disk in chunks of one GB, (i.e. one file is created on the disk for every 
chunk). The adjoint vector is kept fully in RAM, due to the non consecutive memory access 
pattern involved in the reverse interpretation of the tape. The high latency (compared to RAM) 
of disk based storage makes non consecutive (random) memory access expensive. For this case, 
the adjoint vector occupies 70GB of RAM. For bigger cases, the size of the adjoint vector will 
quickly become a bottleneck. A technique to overcome this bottleneck will be presented in the 
next section. 











The IO bandwidth (as measured by iotop) during augmented forward run and adjoint reverse 
propagation is shown in Figure 3.29. As SDLS (see Section 3.4) is enabled, new entries for the 
adjoint stack are only generated outside of the linear solver iterations. During the augmented 
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forward run the NVME system consistently reaches a write speed of 1.8 GB/s. This is fast enough, 
such that the pauses in the stream of write operations due to the symbolic differentiation of linear 
solvers become visible in the IO bandwidth. The SSD system reaches a level of 650 MB/s, after 
an initial peak of higher bursts. We suspect these peaks to be a combination of OS buffering 
and the SSD write caches, which both become saturated after roughly 100 GB of data has been 
written. For the SSD system, no idle periods in the IO bandwidth are visible, as the Linux kernel 
buffers the write operations in memory. Apparently the write performance is not high enough 
to completely drain those buffers during the idle periods. However, the write performance for 
the SSD systems is high enough not to majorly impact the forward run time, compared to the 
NVME system. 

The read performance of both the NVME and SSD system is considerably lower than the write 
performance, maxing out at 300 GB/s for the SSD based system and 550 GB/s for the NVME 
system. In contrast to the write operations, the read operations are blocking. Therefore, if a 
chunk of memory is not already contained in the OS cache, the calculation will stall until the 
chunk is completely fetched from disk. The read performance can potentially be improved by 
prefetching some data, reducing idle time in the IO subsystem. The lower read performance of 
the SSD system clearly effects the run time of the adjoint propagation, as it takes considerably 
longer on the SSD system than on the NVME system. 


3.8.2 Adjoint Vector Compression 


As already mentioned in the previous section, the adjoint vector, which in dco/c++ is stored in 
RAM to retain acceptable random access performance, limits the usefulness of offloading the 
adjoint stack to secondary storage. 

In order to calculate the adjoints for each entry of the adjoint vector, the entries of the adjoint 
vector connected to it by outgoing edges must be accessible. For iterative methods, where many 
changes in the states are only local to the current iteration and do not directly influence any 
values in the next iterations, the required adjoint vector size can be bounded by the longest edge 
in the tape. Once the tape interpretation has moved past a certain element in the adjoint vector 
(that is all partial derivatives of the element have been incremented along its outgoing edges), 
its location in the adjoint vector can be reassigned to another element, which comes later in the 
interpretation procedure and has not yet been incremented. ‘The number of elements required 





to be present in the compressed adjoint vector at the same time is bound by the maximum 





distance between two elements, connected by an edge, in the uncompressed adjoint vector. The 
concept of the compressed adjoint vector was introduced in |NL18]. This thesis will focus on the 
implementation in the context of complex iterative algorithms, the performance cost associated 
with the adjoint vector compression, and a novel implementation to mostly recover the added cost. 





We will first illustrate the adjoint vector compression with an example, before showing it more 
rigorously. As a simple iterative algorithm we consider the Babylonian root finding method. This 
fixed point iteration approximates the square root of a positive real number RT > R*™ : 2 = /a 
by the following iteration procedure: 

1 a 
=5(2+7). 


This iteration procedure can be straightforwardly derived from Newtons method (see Section 2.9.2), 
applied to find the root of f(x) = x? — a, as shown in Equation (3.6). 
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fe) _ ti7e_l («: i <) (3.6) 


Lit, = Li — f'(xi) 7 20; ) Xl; 
It was developed independent of Newtons method several centuries earlier by Babylonian and 
Greek mathematicians, hence the name. 
From an initial guess x9 = a = 2 we find the square root 73 V2 with an error of 0.00015% 
after only 3 iterations. To introduce some intermediate variables, which are local to the iteration, 
we decompose the fixed point iteration into the following equivalent SAC: 





ADtype x = a; // initial guess 
for€int i = 0; i<3; i+t+){ 
ADtype vi = a/x; 
ADtype v2 = vi + x; 
x = QO25* V2; 
i 


Listing 3.23: SAC of Babylonian root finding algorithm for three iterations. 


Using the adjoint method of AD we calculate the approximation of the derivative 


aye) 
Ox = 2/2 


by adjoining the three iterations of the fix point iteration up to an error of 0.00177%. 

Figure 3.31 shows the tape structure generated by the three steps of the algorithm. ‘The 
parameter a is incremented from each iteration step, the adjoint of a must therefore be available 
in the adjoint vector during the interpretation of all iterations. However, the individual iterations 
do not depend on any other iteration steps, except of the state of the directly preceding iteration. 
Outgoing edges of an entry of the adjoint vector only influence entries which are located before it 
in the adjoint vector (that is the partial derivatives were created earlier in the primal evaluation 
phase). Entries below the current interpretation location are never read from or incremented 
again. Those adjoints can be safely discarded, assuming they are not desired for some other 
adjoint calculation, and the space in the adjoint vector may be reused for another entry. We will 
now formalize this observation. 
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Theorem 6. 

Let | denote the maximum distance between two entries e;,e;,1 > 7 of the adjoint vector that are 
connected by an edge (e;,e;) € E. Then the number of elements, that need to be retained in the 
adjoint vector, 1s bounded by sharply | + 1. 


Proof. By induction we will show that a rolling window over the adjoint vector of length / + 1 
is enough to perform all increments. Let S = [s9,...,5n—-1] € IR” be the full adjoint vector 
and S*? = [Smax(0,n—i—l)> +++) 8n—i—1] be a sub vector including at most 1 + 1 consecutive elements 
of S starting at index max(0,n —i-—1). This can be pictured as a rolling window moving 
backwards over the full adjoint vector. 


Base Case: Incrementation of adjoints connected to adjoint entry s”~*~! for i = 0: 
All adjoint vector entries s; with (sn_1,s;) € E have to be incremented. As the distance 
between s,-; and s; is at most 1, the adjoint vector S° = [s,_7,...,8,—1] contains all 
required entries s;. Note how the element from which all edges relevant for this step 
originate is the last entry of the rolling window. 





Inductive step: incrementation of adjoints connected to adjoint entry s”~*"! fori =k-+1: 
The last element of the previous inductive step is not needed any more, because all outgoing 
edges s; with (s,z,s;) € EH have been incremented and there can not be any backward 
pointing edges (s’, s’) ¢ E for all 7 > i. We can thus transform S* into S**! by removing 
the last element and inserting one more element at the front, namely s,,_j_(,41). By the 
definition of J the vector S*+! again includes all elements s; with (S,_1-(k41), $j) © £, 
required to increment all entries connected to the last entry of the adjoint vector S’t1. 


The induction terminates when all entries of the adjoint vector have been fully incremented 
(for 1 =n —1). CI 





The naive adjoint vector compression breaks down, if there are edges in the tape that span multiple 
iteration steps. ‘This is commonly the case if each state depends on a set of parameters, that is 
defined at the beginning of the program. For illustration, see the dashed arrows in Figure 3.30. 
To overcome this, the set of parameters is made available at each point of the tape interpretation. 
Assuming that the first p entries of the adjoint vector S are parameters, the size of the compressed 
adjoint vector can again be bound by S* € R's+!+?, with the following new definitions for S* 
and l.: 


be — max 1 7 ) 
(s;,8;)€E,j>p 


ah 
Ss = So, -++)5p—-1,5n—l,—itpo-++> Se ees ; 


We call the set of parameters fixed at the start of the adjoint vector perpetuated parameters |NL18]. 
Then /, is the length of the longest edge in the tape, not connected to a perpetuated parameter. 

In practical implementation, the transformation of vector S* to S‘t+! would either need copying 
of data, or a more complex linked data structure that allows to remove the first element and 
append a new element at constant cost. ‘To avoid this added complexity, instead the addressing 
into the vector is implemented relying on the cyclic properties of the modulo operation. 

This allows to reuse the same allocated memory for a sequence of vectors S; without the need 
to copy any values. ‘This is best illustrated with an example. The cyclic shifting of the adjoint 
vector is illustrated in ‘Table 3.11. 














124 


3.8 Overcoming AD Memory Limits 


Table 3.11: Adjoint vector of length eight, with longest edge three. Rolling window compression 
(left) and modulo shifting (right). Current position in interpretation underlined. 
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Uncompressed tape representation of the Babylon root example. 
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Figure 3.31: Adjoint vector compressed to the longest edge between states, excluding edges from 
states to parameters. Dashed lines indicate references to the virtual uncompressed adjoint 
vector. 





The interpretation phase, using the adjoint vector compression technique, for the Babylonian 
example is illustrated in Figure 3.31. There is only one perpetuated parameter a. ‘The longest 
edge, not connected to parameter a, spans two nodes. ‘The required size for the compressed 
adjoint vector is thusn, +/+1=1+2+1=4. 

To further illustrate the principle of this approach, an implementation of a basic AD tool, 
applied to the Babylonian root problem, is included in Appendix D. The tool implements both the 
tape interpretation for a regular tape and a modulo compressed tape. For the exact transformation 
of the full adjoint vector to the individual entries of the modulo compressed adjoint vector, refer 
to this code. The implementation in dco/c++ is functionally identical, however not as transparent 
due to the added complexity of using a template engine. 

The main benefit of the adjoint vector compression is that the part of the tape that needs to be 
stored in RAM can be made independent of the number of iterations. ‘This allows to differentiate 
through a high number of iterations by offloading the growing part of the tape to HDD/SSD 
storage while keeping the latency sensitive adjoint vector in RAM completely. 


Implementing Adjoint Vector Compression in OpenFOAM 


The only change required in the discrete adjoint OpenFOAM framework, in order to activate the 
compression of the adjoint vector, is to replace the adjoint datatype dco: :gals<double>: :type 
with the dco::gais_mod<double>::type data type. ‘The compression of the adjoint vector 
is not enabled by default, as it adds a run time penalty to the tape interpretation phase. 
The added modulo operations add additional operations and prohibit the CPU from utilizing 
speculative execution. 
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Figure 3.32: Run time of 16 iterations of the Pitz-Daily case using the uncompressed (als) and 
the modulo compressed (als mod) adjoint vector implementation. 


As a heuristic to determine the set of parameters which need to be perpetuated, dco/c++ 
treats all variables that are explicitly registered with the tape as parameters. After changing the 
adjoint data type, the remaining code of adjointSimpleFoam can be left unchanged, as the first 
action after creating the tape is already to register all parameters a, making them perpetuated 
by default. 

Before we examine the memory savings achieved by the adjoint vector in detail, we will first 
look at the run time. The run times for 16 iterations of the adjointSimpleFoam solver on the 
Pitz-Daily case, with and without adjoint vector compression, is shown in Figure 3.32. While 
the run time of the augmented primal evaluation stays nearly identical, the time required for 
the interpretation of the tape grows considerably. This illustrates the additional computational 
complexity introduced by the modulo operation, performed with each increment of adjoint 
variables. ‘The time required for the allocation of the adjoint vector decreases, already hinting at 
a reduction in RAM usage. 

The extent of the run time penalty, introduced by the modulo operator, and consequently an 
optimization to reduce the overhead of the reverse adjoint propagation, will be presented in the 
following section. 

















3.8.3 Bitwise Modulo Optimization 





The introduction of the compressed adjoint vector increases the run time of the reverse adjoint 
propagation phase. Discrete adjoint OpenFOAM seems to be particularly susceptible to this 
increase in run time, other synthetic benchmarks in the dco/c++ benchmark suite exhibit a less 
significant run time increase. The root cause for the run time increase is the execution of the 
modulo operator in every adjoint incrementation. The data flow reversal, at least in parts of the 
calculation, is not memory bound, such that this additional numerical operation has a significant 
impact on the total run time. A naive implementation of the modulo operator might look like 
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Table 3.12: Calculation of (18 mod 8) = 5 using the bitwise operation optimization for divisors 
of power two. 


bz bo b, bo 

a Il 1 QO | 

6 1 0 O O 

b-1 O 1 1 1 
a&(b-1) 0 1 0 1 


the following |Knu97]: 
a mod b=a— alg 


The actual implementation by the compiler might be optimized, but will still generally require an 
integer or floating point division. 

The run time of the adjoint propagation phase can be reduced by expanding the adjoint vector 
size to the next power of two, allowing to replace the modulo operation by a more efficient bitwise 
operation. On all modern platforms bitwise operators are executed very efficiently, especially 
compared to the floating point division operation involved in the calculation of the modulo 
remainder. In the worst case, the adjoint vector grows by a factor of two, when expanding the 
size to the next power of two. Due to the small adjoint vector size, compared to the stack memory 
size, this should be negligible in most cases. We will now show that the equivalence between 
modulo operation and bitwise AND operation holds for divisors which are powers of two. 





Theorem 7. 
LetiecN, b=2' andae Nt. Then the following equivalence holds: 


amodb=a& (b—-1), 
where & is the bitwise AND operator. 


Proof. The bitmask b — 1 masks all bits lower than the single significant digit corresponding to 6. 
The bitwise AND operation then strips away all digits of a left of and including the digit in 0, 
removing all integer multiples of b from a. ‘This yields the desired remainder a — | $ | b: 


b 
i=] . 
b-1=) 2 
j=0 
[logs (a) | 


a& (b-—1)=a- 2=a—|>|b. 





iat 
LJ 


The bitwise AND calculation, replacing a modulo operation, is illustrated in Table 3.12 on the 
example (13 mod 2°) = $. 

To quantify the impact of the operations, we will first analyze the impact of the modulo and 
bitwise AND operator on a synthetic benchmark and its disassembly. The loops in Listing 3.24 
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/7 wwodulo 

for Gini = Oc <n LO24e3 ++) 4 
a = af. ob: 

} 

i) Soucy ese. AND 

for (int i=) Oc 1) O24. ee) 
a=a & (b-1); 

} 


Listing 3.24: Synthetic benchmark for modulo and bitwise AND operation. 


are benchmarked using the google benchmark framework? with random positive integers for a 
and b = 2°. 

As can be seen in the disassembly in Listing 3.25, on systems with a x86 or IA64 based processor 
the compiler schedules a call to the idiv1 |Int16] instruction, which computes the quotient, as 
well as the remainder of two signed integer numbers. ‘The computation can be slightly sped up 
(approx. 5%) by switching a and 6 to signed data types. Measuring the execution time of individual 
instructions on modern architectures is not straightforward, due to instruction fusing and out of 
order execution. The following benchmarks should thus only be interpreted qualitatively. 

The for-loops around the operations are necessary to capture a measurable time delta for 
the bitwise AND operation at all. The benchmark framework polls the loops several hundred 
thousand times to get a reliable average of the run time. The benchmark tool reports an average 
run time of 6328 ns for the first loop and a run time of 100 ns for the second loop, with a standard 
deviation of below one percent. ‘The run time of the second loop can be further reduced to 89 ns 
by eliminating the subtraction of one (as the adjoint vector size is known when entering the data 
flow reversal the subtraction can be performed once instead of at each calculation of the modulus). 
Absent of any memory access (a and 0 are held in registers) the bitwise AND operation is faster 
than the modulo operation by a factor of above 60. Assuming the full clock rate of 3.3 GHz of 
the machine can be utilized for the benchmark, this nets an average execution time of 0.25 CPU 
cycles per bitwise AND instruction and 20 cycles per regular modulo instruction, indicating that 
the compiler is able to efficiently vectorize the instructions. 

Switching from the synthetic benchmarks to the application in CFD codes, we expect the savings 
by the bitwise AND operation to be much smaller, as not all operands are stored in processor 
registers. ‘Therefore, memory latency is introduced which will affect both implementations equally. 

In Figure 3.33 the earlier picture showing 16 iterations of the adjointSimpleFoam solver is 
updated with the newly obtained results from the bitwise AND operator. ‘The modulo operations 
added a significant run time overhead to the regular adjoint solver. Almost the whole overhead can 
be recovered by switching to the bitwise AND operation instead. As seen previously the allocation 
of the adjoint vector decreases in run time significantly for both modulo implementations, due to 
the much smaller adjoint vector. 





*https://github.com/google/benchmark 
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VOi1d £00 (int) ann mt bb) 4 
int c =a 4b; 
ia Gl = gy 4 (Co5il)) < 
} 
Vord Loo (int a. int b) 4 
0: 55 push Arbp 
1: 48 89 e5 mov Arsp,hrbp 
4: 89 7d ec mov hedi ,-0x14(4%rbp) 
7: 89 75 e8 mov hesi ,-0x18 (/4rbp) 
int 1c. = a, b. 
a: 8b 45 ec mov -Ox1l4(/rbp) ,,eax 
d: 99 cltd 
e: £7 7d e8 idivl -0x18(/rbp) // signed integer division 
11: $9 55 £8 mov hedx ,-O0x8(/rbp) 
cae cl = ey fe (Cail) 3 
14: 8b 45 e8 mov -Ox18(/rbp) ,,eax 
17: 83 e8 O1 sub $0x1 ,,eax / AMmsubteract V1 
la: 23 45 e¢ and -Ox14(/rbp) ,,eax // bitwise AND 
ld: 69 "45> te mov heax ,-Ox4(/rbp) 
Ir 


Listing 3.25: Disassembly of Lines 1-4, as generated by g++ -c -00 -g. Optimization will 
remove some of the move instructions, however the arithmetic operations remain unchanged. 
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Figure 3.33: Run time of 16 iterations of the Pitz-Daily case using the modulo (als mod) and 
bitwise AND (als_ bitw) adjoint vector compression. 
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The four Figures 3.34—-3.37 show the same case, with different number of iterations between 
one and 64. The maximum size of 64 iterations is chosen such that the tape fills the 128 GB of 
RAM of the test machine almost completely when not utilizing SDLS. For 64 iterations, the tape 
consumes around 100 GB of RAM, the uncompressed adjoint vector around 19GB. Utilizing the 
modulo operation, the size of the adjoint vector compresses down to 365 MB. When increasing 
the adjoint vector length to the next power of two, the size of the adjoint vector grows to 512 MB. 
The figures clearly show, that the compressed adjoint vector size stays constant for increasing 
iteration counts. The run time for the interpretation increases slightly when utilizing the bitwise 
AND optimization, but grows with the same rate as the uncompressed version. All effects of the 
adjoint vector compression, both in terms of run time and compression, are less pronounced for 
the SDLS enabled runs, as the ratio of instructions treated by dco/c++ compared to the overall 
operations is lower in those cases. 

Summarizing, the modified version of the adjoint vector compression allows to regain most 
of the performance, that was lost when the regular modulo adjoint vector compression was 
introduced. The adjoint vector compression makes large calculations feasible, which would have 
been limited by the memory size of the adjoint vector before. 
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Figure 3.34: RAM usage without SDLS for uncompressed and compressed adjoint vector using 
the modulo and bitwise AND operators. 
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Figure 3.35: RAM usage with SDLS for uncompressed and compressed adjoint vector using the 
modulo and bitwise AND operators. 
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Figure 3.36: Run time without SDLS for uncompressed and compressed adjoint vector using 
the modulo and bitwise AND operators. 
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Figure 3.37: Run time with SDLS for uncompressed and compressed adjoint vector using the 
modulo and bitwise AND operators. 
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4 Implementing Efficient Algorithms for Steady Flows 


After introducing AD to the calculation of full iteration histories, we will now discuss approaches 
that will improve efficiency of the computations for steady state flow cases. 


4.1 Reverse Accumulation 


4.1.1 Introduction 


Reverse Accumulation allows to iteratively accumulate the adjoints of a problem which contrac- 
tively converges to a steady solution (fix point) by adjoining the last iteration step repeatedly. 
The algorithm is extensively documented in the literature |Chr94; Chr92; Gil92], here we focus 
on the application of the algorithm to typical CFD optimization cases. ‘To this end, we assume 
the parameters to be the topology optimization parameters @ € R”¢. 

For the purpose of describing the reverse accumulation algorithm, we again separate our 
problem into a pre-processing step x” = P(a), (iterative) processing step 


x” — F(x°,a) = f” Ca of © (x*?, a) o...0 f' (P(e), a) , 








and post-processing step y = J(x",a) = J(F(P(a)),a). The independent (state) variables x 
are initialized in the pre-processing step, potentially depending on the parameters a. For laminar 
steady flow, the independents are the velocity, pressure, and face flux vectors x = (U,p,@). From 
the output of the pre-processing step, the state is iteratively converged to the fixed point x”, 
bringing the residual of the Navier-Stokes equations towards zero. ‘The iteration function f can, 
potentially for every iteration step, switch between different execution branches, e.g. due to 
upwinding in the discretization schemes, therefore a different function f’ is assumed for every 
iteration step. After the converged state is reached, the cost function 7 is calculated from the 
last state x*, and potentially also from a. 

As already stated previously in Section 3.2.2, applying AD to the whole iteration process of k 
iterations w.r.t. the parameters a@ yields 


wt apt ant! afk 
da  Ox*k-!1 da Oa’ 








(4.1) 


with the recursion formula 








of* dxk—1 4 oe eT 


dx* Oxk-l1 da Oa 
da Of 0a" Pa 
Ox° Oa =" 


Reverse Accumulation uses the knowledge, that for a problem which has reached a converged 


state x* = x* the Jacobian 
Oxk 


V f(x") —_ Oxk—1 
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fri, 








Figure 4.1: DAG of the iterations used for reverse accumulation. x’ denotes the penultimate 
state from which on taping is enabled. The final state x* is calculated from x! by the fixed 
iteration step f*. Compare to the black-box differentiation in Figure 3.5. 


of the state x” = f(x*—') differentiated w.r.t. to the previous state x*~! has also converged to 
a fixed state V f*. Furthermore, the location of the fix point x* is independent of the starting 
point x, and therefore 








dx* 
_B_ |ig~ @) ; 
| dx? 
and consequently 
dy 
——|| +0. 
| dx? 








Let x* be the last iterate of a fixed point iteration, x7 the penultimate state, and f* the 
last iteration producing the state x* from x'. Instead of evaluating the full chain (4.1) the 
alternative chain 


da \ dxt Oa Oxt Oa Oxt Oa Oa 


























(4.2) 


can be calculated. This is the earlier recursion formula explicitly unrolled for fixed Jacobians 
Of* /Ox! and Of*/Oq@ instead of Of jOx = and Of’ /Oa. The chain can be evaluated for an 
arbitrary number of iterations k. Convergence predictions are given in |Chr94|. In our application 
we choose k high enough to ensure convergence of the adjoint fixed point iteration. 

The main advantage of the reverse accumulation approach is that only one iteration needs to be 
captured in the tape. Highlighting the passive and active sections of the procedure, the generation 
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of the output y from x° is illustrated in Figure 4.1. Passive computations are connected by 
dashed lines, computations which need to be captured inside the tape are drawn solid. 

The alternative chain (4.2) can be conveniently evaluated by reverse mode AD. Interpreting the 
tape for the iteration step f* : R"**"¢ — R™~, which generates x* from x", yields incremental 


= ee : 
Oxi 

ior) 

Q + ( a -S, 

where s € R”™ is an arbitrary adjoint seed vector. 


By choosing the first seed vector as s? = (O07 / Ox*)" we construct the following iteration 
which evaluates (4.2) by repeatedly using the result x of the previous iteration as the seed vector 


for the next iteration: 
1_ (OF "0 — OF Of” 
Oxt Ox* \ Oxi 
»_ (OF \" 1 _ OF (af\’ 
Oxi Ox* \ Oxt (4.3) 
gk — xk — (OF Ry IIT afe\" 
Oxt Ox* \ Oxt 


While the adjoint x is reset to zero after each iteration, the adjoint @ is allowed to accumulate 
and yields the desired adjoint approximation for (OJ /Ox* )(df*/da@): 


projections of the form 





| 
| 





a 











~” 
| 








w—N 
|| 
I 




















“ih Oe (of) DIO AT Of" 
a ae 0% 






























































~ 0 0a dx* Ja 

_ _ of* Of*  of* of* 

2 ala 
oe (5 =) ace 

3_ gta 
—— +(3 + 5 Z (SE +38 + (55) Oa (4.4) 
a a aaa se a (8 ‘Of | (aft\" of 
~ +5 7 on Ja Oxi} Oa Oxi Ja 

















OT OT oF sie 2 of* k—1 of* 
= 1 — 
Ba! Ox ( T Oxt 7 (35 van 0a 
With f* being a contractive function, the norm (apt /axt)"| will tend to zero for k > ow, 


making x” an indicator for the convergence of @. 
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4.1.2 Implementation in a CFD Setting 


In the following the procedure of accumulating the adjoints is detailed with focus on the usage of an 
AD tool. The seeding and calculation of the adjoints 7 of the post-processing step is left unchanged 
from the standard black-box or checkpointed approaches, yielding the adjoints x* = (O07 /Ox* yi -1 
and & = (07/da)* -1 of the cost function J w.r.t. the final state x* and the parameters a. 

The tape of the post-processing step can be discarded, once the adjoints x* have been calculated. 
The tape of the final iteration step is interpreted from x* back to x!, without discarding the 
tape. This calculates the adjoints x', but also begins to accumulate the desired adjoints & by 
incrementing them with the adjoint chain @ = @+(07/0x*)(Of*/da). Next, the adjoints x! are 
extracted from the tape and written to a temporary storage field. This step is slightly complicated 
by the fact, that usually iterative solvers overwrite the existing state with the newly calculated 
one, i.e. xt! = f(x", a@) is actually implemented aliased as x := f(x, a). While the adjoints x! 
exist in the tape, they can not be addressed by the variables x (i.e. dco: :derivative(x) in 
dco/c++ notation), as this would instead yield x* (memory aliasing). Therefore, the location 
of x* in the tape must be saved after first registering the state x in the tape. This procedure 
is similar to the extraction of adjoints of the state using checkpointing. For the OpenFOAM 
implementation, we can therefore reuse parts of the adjoint checkpointing interface for the reverse 
accumulation iteration. 

After extracting x', the adjoints of all tape entries between x! and x* in the adjoint vector 
are reset to zero, leaving only the already partially accumulated adjoints @ as non-zeroes in the 
adjoint vector. 

Next, the stored adjoints of the state x” are written back into the adjoint state x”+!. The 
tape is then re-interpreted from x* to x', yielding a new adjoint state x! and incrementing the 
adjoints of the parameters a. 

Summarizing the above procedure the following steps needs to be implemented by a CFD solver 
utilizing reverse accumulation, enabling to evaluate the adjoint chains (4.3) and (4.4): 











e Calculate n — 1 iterations in passive mode up to x* (choose n such that solution has 
sufficiently converged). 


e Register all required parameters @ in tape. 
e Store current position of tape > TT. 
e Register the state variables x! in tape. 
e Calculate the nth iteration in augmented forward mode, yielding x”*. 
e Store the current position of the tape > 7*. 
e Evaluate the cost function 7% and seed adjoint 7 = 1. 
e To evaluate Cl we proceed as follows: 
— Interpret tape from final position up to 7%. 
— Get adjoints of final state and store them x, = x”. 
1. Set adjoints of the state x* = Xg.. 
2. Interpret tape from position T* to TT. 


3. Extract adjoints of initial state and store them X, = X’. 
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4. Reset the values of all adjoints from 7* to 7 to zero. 
— Repeat steps (1-4) until ||x.|| < e. 


e Cl is now available in a. 


Utilizing reverse accumulation, the calculation can be sped up considerably, as no checkpointing of 
an iteration evolution is required. The tape re-evaluation is quite efficient compared to augmented 
forward steps, further improving runtime. However, it can be difficult to find an iteration step 
which is a suitable candidate for reverse iteration, if the residuals are noisy. An example for 
the convergence of reverse accumulation compared to the black-box approach is given in the 
following section. 

The source code of an OpenFOAM solver, implementing reverse accumulation, using the 
aforementioned checkpointing interface, is given in Appendix C.1. 





4.2 Piggyback Adjoint Iteration 


The concept of piggyback adjoints in the context of AD was first introduced in [|GF03] and further 
detailed in |GW0O8]. In this approach, the adjoints are propagated alongside the primal values, 
somewhat mirroring the behavior of the continuous adjoint, where the adjoint equations are 
solved alongside the primal equations. In a one-shot optimization setting [Bos+14], the design 
parameters are immediately optimized using the partially converged derivative information. 

The updates of primal, adjoint, and design state are therefore evaluated in one coupled iteration. 
With the functions f? : R’* x R? > R” which calculate one iteration step of the primal, and 
g: IR" X R™ X R* which calculates the adjoints of the states, as well as the gradient required 
for the update step, the iteration procedure can be outlined as follows: 





fork =0,....n—-—1: 
x kr = P(x" a) 
xktl eu gh (x, x", a) 


k+1 _ —l kk sk k 
a’ =az,—P 'ge(x’, x", a’), 


where P is a suitable preconditioner, to ensure the convergence and stability of the design opti- 
mization. 

When applying AD, the function g is not given explicitly, but is evaluated by adjoining the 
iteration f”, as well as the calculation of the objective y = J(x*t") after each iteration step. 

The piggyback approach is closely related to reverse accumulation. It differs in that not a fixed 
(last) iteration step of a fixed point iteration is repeatedly adjoined. Instead always the most 
recent iteration f* of the augmented primal is used to obtain the next iterate for the adjoints x*t1. 
New iteration steps are repeatedly calculated and adjoined until the change in both x and x 
fall under a prescribed threshold. Only after this threshold is reached, an update of the design 
parameters @ is performed. The change in @ will in turn increase the residuals of both x and x, 
starting another round of inner piggyback iterations. 

Even without applying an optimization, one advantage of the piggyback approach is that no 
fixed point has to be identified in advance. The method lends itself well to one-shot optimization, 
as the optimization of the parameters @ can be started with a not completely converged gradient. 
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Figure 4.2: Sensitivities obtained by tangent and adjoint modes. The adjoint is piggybacked 
from the initial state and reverse accumulated by repeatedly adjoining primal iteration step 
800. 


The piggyback method has been implemented in discrete adjoint OpenFOAM for the SIMPLE 
algorithm, derived from the simpleFoam solver. As with reverse accumulation solvers, the 
checkpointing interface can be used to help with the handling of the adjoints. Figure 4.2 shows a 
comparison of the convergence of the sum of sensitivities for black-box tangents (which are identical 
to the black-box adjoints), reverse accumulation starting after 800 iterations, and piggybacking 
(without design updates, that is P~! is a zero matrix). The derivatives are calculated on the 2D 
Pitz-Daily case, with SDLS enabled for piggyback and reverse accumulation solvers. ‘The change in 
the derivatives from iteration to iteration is detailed in Figure 4.3. Reverse accumulation converges 
down to machine precision. ‘The convergence of the black-box differentiation and piggyback 
bottoms out after roughly 600 iterations, however the derivatives have converged sufficiently to not 
observe any significant changes in the adjoints. The obtained derivative residuals roughly match 
the chosen linear solver tolerance, a stricter tolerance level will further improve the residuals. 

Further applications of the piggyback approach, including optimization using steepest descent 
algorithm are given in the case study presented in Section 5.1. 
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Figure 4.3: Change in the sum of sensitivities between subsequent iterations. Initial residual 
scaled to one. 





4.3 Verification 


While the results of the sensitivity calculations obtained so far look plausible, they certainly still 
need to be verified. We will achieve this using the following steps: 


e Check the correct implementation of AD on the algorithms, by comparing results obtained 
by the adjoint method with results obtained by tangent mode and FD. 


e Verify that the differentiated algorithms match the physics of the continuous adjoint, by 
comparing to results obtained with the continuous adjoint method. 
The adjointShapeOptimizationFoam solver, supplied with OpenFOAM, is used for calcu- 
lating the continuous adjoints. 


4.3.1 Mesh Independence of Adjoint Sensitivity 


For a steady state problem, the flow fields, obtained by a FVM discretization, should, for increasing 
mesh resolution, converge towards a final state, as truncation errors decay. As the fields are 
evaluated at different positions for each discretization, we evaluate the volume integrals of the 
fields instead to judge convergence. 

The angled duct case is evaluated for mesh refinement levels 5 (325 cells) to 80 (83 200 cells). 
The primal calculation is performed with passive simpleFoam iterations, starting from a pre- 
initialized potential solution. The primal iterations are run until both velocity and pressure fields 
have reached a prescribed convergence tolerance. Afterwards one augmented forward primal 
step is executed and then repeatedly adjoined using reverse accumulation. ‘To speed up the 
calculation, the mesh is decomposed into 8 processor domains. Figure 4.4 shows the number of 
iterations required to achieve the prescribed solver tolerances. The number of primal evaluations 
scales roughly linearly with the number of cells in the domain. ‘The number of adjoint reverse 
accumulations required rises slower, and after a certain point seems to grow logarithmically. 

The change in the velocity, pressure, and sensitivity fields, evaluated as volume integrals, is 
detailed in Figure 4.5. Shown is the change in volume integral, compared to the previous mesh 





141 


4 Implementing Efficient Algorithms for Steady Flows 


2500 
: 
‘= 2000 
s 
g 
1500 
S 
& 1000 
5 —e— Primal Iterations 
Z 500 —=— Adjoint Iteration 





1 2 3 1 ,) 6 a 8 
Mesh size no 104 


Figure 4.4: Number of primal SIMPLE iterations and number of adjoint reverse accumulation 
iterations needed to bring case to convergence. 


resolution. ‘The curves are normalized and shown in a double logarithmic scale. The change in 
all fields decreases, as the FVM approximations converge towards the solution of the continuous 
problem. The pressure convergence lags behind the velocity, which is common with SIMPLE 
algorithms and is amplified by the lower relaxation factor chosen for the pressure correction 
equation. The formulation of the cost function is dominated by the influence of the pressure, 
therefore it is plausible that the convergence rate of the sensitivity seems strongly linked to the 
convergence rate of the primal pressure. 

Summarizing, the adjoint sensitivity field converges towards a fixed point, as the mesh resolution 
increases. It does so with a convergence speed comparable to the primal iteration. ‘The number 
of iterations required to achieve adjoint convergence are on par, or lower, than the primal. 
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Figure 4.5: Change in velocity, pressure, and adjoint sensitivity field, compared to previous 
mesh resolution. First value is normalized to one. 


142 


4.3 Verification 


4.3.2 Validation Against Tangent and FD Models 


To validate the implementation of AD, we first use the laminar angled duct testcase. Units are 
introduced to the dimensionless geometry by choosing L = 1m, with the origin at the lower left 
corner of the geometry. 

In Figure 4.6 we show the values for the derivative d7/da at different discrete locations along 
two evaluation lines crossing the domain. The first line Ly connects the points (0, 0.55, 0.05) 
and (5,0.55,0.05), the second line L2 connects (4.55, 0, 0.05) to (4.55, 5, 0.05) (all coordinates in 
meters). The coordinates of the endpoints are chosen, such that the line passes the cell midpoints 
of the cells it penetrates. ‘The location of the lines L; and Lo, as well as the full sensitivity field 
are shown in Figure 4.7. 

Figure 4.6 compares the results obtained by 400 iterations of adjointSimpleCheckpointingFoam 
to the results obtained by a tangent and FD implementation. For the adjoint results, the symbol- 
ical differentiation of the linear solvers is disabled, to ensure comparability to the tangents with 
up to machine precision. ‘The tangent implementation uses the vector mode of dco/c++, allowing 
to evaluate all 16 reference points with one solver run. A tangent vector size of 16 was utilized, 
using the 16th entry to calculate the sum of all sensitivities. 

For FD, a one sided perturbation with h = 10~° was used. The tangent and FD evaluation 
points are offset by some margin, so that they do not overlap in the plot. The adjoint sensitivities 
are plotted without interpolating between the cells (to allow us to precisely match the tangent 
and FD data points), giving a piecewise constant sensitivity evolution along the lines Ly and Lg. 

The lines intersect the inflow and outflow boundary. At the inflow and outflow boundary, we 
observe a discontinuous behavior in the adjoint field. This is matched by the tangents and FD. 
It is thus not an error in the differentiation, but rather an issue of the primal implementation. 
The artifacts presumably arise from the implementation of the boundary conditions, which is 
continuous for the primal but not necessarily for the adjoint. The artifacts are limited to the 
cells adjacent to the inflow, outflow, and their neighboring cells. Further up- and downstream no 
obvious influence of those discontinuities can be observed. The wall boundary conditions do not 
exhibit such behavior. 

In ‘Table 4.1 we summarize the adjoints for five exemplary points. The tangents match the 
adjoints up to some orders of magnitude of the machine precision. Slight deviations are induced 
due to the different order of application of arithmetic operations between adjoint and tangent mode, 


and the use of the -ffast-math compiler flag which can lead to additional truncation errors!. 





Table 4.1: Adjoints, tangents, and FD for five exemplary cells. Differences between adjoints and 
tangents are marked bold. 


Cell index Adjoint Tangent FD Error of FD 


970 13351.8234523369  13351.8234523345  13351.6289 0.0015 % 
1013 14929.8557699385  14929.8557699349  14929.9356 0.0005 % 
2785 8605.97203473893 8605.97203473516  8605.8119 0.0019 % 
2817 603.880863106184 603.880863106741 603.6694 0.0350 % 
A099 -81.383599564168 = -81.383599564306 = -81.4823 0.1213 % 


“https: //gcc.gnu.org /wiki/FloatingPointMath 
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Figure 4.6: Sensitivities on two lines through the domain, computed with adjoint mode without 
interpolation between cells. The results are verified with 15 points computed by T1V and FD 
mode respectively. 
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Figure 4.7: Sensitivity field of angled duct case. Evaluation lines Ly; and Lz are shown in white. 
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Figure 4.8: Sum of sensitivities computed with FD, tangent mode, and adjoint mode. FD with 
h = 107° unreliable, all other curves match. 


Figure 4.8 shows the evolution of the sum of sensitivities 





. dJ (x!) 
5! = 
odes 
over 150 iterations. This sum can be conveniently calculated by FD and tangent mode with a 
run-time factor of O(1) - cost(J(a@)), by seeding all inputs at the same time: 


So ge ye = St, 27. 
da 


——ae; ae e; — = 
~ da da <— da 
i} i} 


In adjoint mode, all sensitivities of a specific iteration step, and thus also the sum, are calculated 
in O(1) - cost(7(q@)) as well. However, to obtain the sum after every iteration step, the tape has 
to be completely evaluated from the current position back to the inputs after each iteration step. 
This raises the complexity to the number of iterations. The sensitivities after each iteration are 
only needed for this verification task, in practice one would only evaluate the tape once, after the 
primal iteration has finished. 

The tangent and adjoint sums match up to machine precision for each iteration step. FD 
for h = 10~° also matches very closely. FD for h = 10~° exhibits significant noise, however the 
sensitivities obtained (or at least the sum) remain stable and oscillate around the correct value. 


4.3.3 Validation Against Continuous Adjoint Solver 


Next, we validate against the continuous adjoint solver included in OpenFOAM. This solver 
is based on the principles described in [Oth08], the implementation is further documented 
in |OVWO7|. It implements topology optimization for ducted flows and optimizes for power 
loss using a steepest descent approach with fixed step width. For an introduction to the primal 
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Figure 4.9: Sensitivity fields obtained with discrete piggyback solver. Power loss integrated over 
the inlet on the left, power loss integrated over inlet and outlet on the right. Contours of zero 
sensitivity in white. While the field differs at the outflow, regions of negative sensitivity are 
almost identical. 





and continuous adjoint equations, refer to Section 2.5. The cost function is hard coded into the 
boundary conditions adjointOutletVelocity and adjointOutletPressure. ‘he primal and 
adjoint equations are iterated concurrently, the resulting sensitivities are immediately used to 
update the design field @ (this is the continuous equivalent to the discrete piggyback approach). 
The solver calculates the adjoint velocity and pressure U and p, the sensitivities of the individual 
design parameters a; are then calculated as the scalar product of primal and adjoint velocity 
field: 





dJ 
da; 





= A;(U;- U;). 


For the first part of the validation study, we set the step width of the steepest descent optimizer 
to zero. Thus, the sensitivity field for a fixed (zero) field @ is calculated. 

The discrete solver is set up to match the cost function and boundary conditions of the stock 
adjoint solver. ‘his requires to constrain the computation of the power loss to the inlet. If the 
outlet is included, the sensitivity field changes substantially, while still indicating the same regions 
of optimization. ‘This effect is shown in Figure 4.9. 

The sensitivities are again evaluated along the lines Ly and Lo. Figure 4.10 shows the 
sensitivities for the continuous and discrete solvers after convergence (2000 iterations). One can 
see, that the sensitivities line up very well. ‘The irregularities introduced by the discrete solver at 
the inflow and outflow boundaries do not introduce any effects in the inner flow domain, which 
would lead to obvious differences to the continuous solution. 

Figure 4.11 shows a condensed y-axis of the previous plot, focusing on the most important 
region of negative sensitivity. While we see some differences in magnitude between the continuous 
and discrete adjoint solution here, the sign of the sensitivities is consistent between the adjoint 
and discrete calculations. A topology optimization approach would thus penalize the same cells, 
albeit with a slightly different magnitude. For a graphical representation of these regions, we 
show the sensitivity field for the continuous and discrete solver in Figure 4.12. White iso lines 
indicate the zero crossings of the sensitivity, bounding regions of negative sensitivity. The results 
are qualitatively identical to the results obtained with the discrete solver, albeit showing a slightly 
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Figure 4.10: Sensitivities along L; and Lo obtained using the continuous and discrete adjoint 
approach. 
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Figure 4.11: Magnification of the negative section of Figure 4.10. 
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Figure 4.12: Sensitivity field obtained with the continuous (left) and discrete approach (right). 
Iso-lines of zero sensitivity in white mark the regions of interest for topology optimization. 


larger zone of negative sensitivity near the upper right corner of the flow domain. 

In Figures 4.13 and 4.14 we compare the convergence speed of the continuous primal and 
adjoint equations to the convergence of the piggyback approach. As stated earlier, the piggyback 
approach iterates the adjoint along the primals in a similar fashion to the implicitly coupled 
continuous adjoint, making this comparison an obvious choice. The former figure shows the 
convergence of the sum of sensitivities for a mesh of 11 700 cells, the latter for a finer mesh of 
187 200 cells. While the continuous and discrete solver converge roughly along the same trajectory, 
the discrete solver does so with significantly less oscillations, thus arriving earlier at a point where 
the adjoints can be considered reliable. This potentially leads to a better approximation of the 
eradient in a one shot optimization setting. Lowering of the under-relaxation factors of the primal 
and adjoint continuous equations might lead to a reduction of the oscillations, at the expense of 
slower convergence. 
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Figure 4.13: Convergence of the sums of sensitivities of piggybacking and continuous adjoint 
for medium resolution case. 
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Figure 4.14: Convergence of the sums of sensitivities of piggybacking and continuous adjoint 
for fine resolution case. 
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Next, we consider the Pitz-Daily case. For this example, we can not expect a perfect match 
with the continuous results, as the continuous solver available in the public domain uses a 
frozen turbulence assumption, which assumes that the influence of the turbulence onto the 
sensitivities is negligible and can therefore be omitted. The derivation and implementation of 
continuous equations for turbulence models is complex and sometimes not even possible in a closed 
form |Car+10; Nem+11]. The frozen turbulence assumption can be approximately replicated 
with the discrete method, by pausing the tape recording during the calculation of turbulence. 
The turbulent quantities are thus treated as if they were passive. 

The primal flow and the turbulent kinetic energy within the Pitz-Daily geometry is shown in 
Figure 4.15. The sensitivity results of the simulation after 8000 steps (from a zero initialized 
solution) are shown in Figure 4.16. 

The sensitivities computed with the continuous adjoint solver are shown on top. Below that are 
the results obtained with the discrete method. The second image is computed by the piggyback 
approach with differentiated k-e turbulence model. The third image is computed with the same 
solver, but with the taping of the turbulence model switched off. In this configuration the discrete 
model thus replicates the frozen turbulence assumption of the continuous solver. The three images 
look essentially alike, especially the region of negative sensitivity indicated by the white contour 
lines is very similar. The biggest conceivable difference is located at the location of the step. At 
this position a solution singularity in the sensitivity can be observed. The maximum sensitivity, 
obtained at the singularity point by the continuous solver, is roughly 30% higher than the discrete 
sensitivity obtained with both frozen and fully differentiated turbulence. Further downstream the 
sensitivities match very well. 





For further insight, we again plot the sensitivities over lines through the computational domain. 
We observe five vertical lines, cutting through the domain at different distances from the inlet. 
The first line L, is located 0.001 m right of the step, capturing the influence of the singularity 
point. The second to fifth line (Lz to Ls) are located at x = {0.05 m, 0.1m, 0.15 m, 0.2m} behind 
the step respectively. For the first line, both discrete solutions are very similar, indicating that 
the effect of the turbulence on the sensitivities is low. However, the discrete adjoints do not 
match the continuous solution too well. Nevertheless, the sign of the sensitivities is consistent 
between the continuous and discrete results. Away from the singularity (which is located at 0.5 
along the normalized line distance) the results match better. For lines Lz — Ls, downstream of 
the singularity, the frozen turbulence and fully differentiated discrete results start to significantly 
differ. From Figure 4.15 we can see that this coincides with the regions of high turbulent kinetic 
energy k. ‘The result obtained by the discrete frozen turbulence assumption is in turn now very 





similar to the continuous result, showing a maximum difference of 7% and an average difference 
of below 2% along the line. Furthermore, the (interpolated) lines of the discrete solution match 
all kinks and features of the continuous lines. 
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Figure 4.15: Velocity magnitude (absolute length of the velocity vector, top) and turbulent 
kinetic energy (bottom) of the Pitz-Daily case. 
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Figure 4.16: Sensitivity fields obtained with the continuous (top), discrete (middle), and discrete 
approach with frozen turbulence (bottom). Iso-lines of zero sensitivity in white mark the 
regions of interest for potential topology optimization. 
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Figure 4.17: Sensitivities along (from top to bottom) lines Ly to Ls. The continuous adjoints 
match the discrete adjoints obtained with frozen turbulence. 
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Figure 4.18: Sensitivities along line L5, obtained by the discrete adjoint model and matched by 
FD. The continuous model with frozen turbulence (drawn dashed) clearly differs. 


As we could not check the correctness of the fully differentiated turbulence models with the 
continuous adjoint model, we again compare those values against FD and the tangent model. In 
Figure 4.18 we exemplary verify the fully differentiated turbulence model with FD for 13 points 
on Ls. The values obtained by FD match the fully discrete adjoint and clearly differ from the 
frozen turbulence model. 


4.3.4 Convergence of FD 


As derived in Section 2.7.1, FD approximates the derivative up to a truncation error of order O(h) 
for one sided differences and O(h”) for two-sided differences. Thus, the difference between FD 
and the derivatives calculated by AD should shrink according to the relations 





JT(a+he;)—-TJ(a) dJ/(a) 








h age 
TI (a+ Ns — hej) _ a = 19 = O(R?), 


for some index 7. We will now investigate if these relations hold for our implementation. 

Figure 4.19 shows the difference between the FD approximation and the derivative calculated 
by tangent mode for the cell at location P = (0.02, —0.02,0.0005) of the Pitz-Daily case. Plotted 
are the absolute values of the difference r; and rg of FD and tangent, scaled by the tangent, for a 
range of values h. 

The step width of h is chosen in the range [10~°, 10°]. For relatively big values of h, the forward 
and central finite differences adhere strictly to the reduction factors O(h) and O(h?). For h in 
the range [10~7, 10"], the two sided differences, while still providing a better approximation than 
one sided, behave a bit erratically. We suspect that this is due to the different treatment of 
the a term for positive and negative signs. While positive values of a contribute to the system 
matrix and increase the diagonal dominance of the matrix, negative values are put onto the right 
hand side of the momentum equations. For h < 10~°, the quality of the approximation begins to 
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Figure 4.19: Difference between FD and (tangent) AD for decreasing h. 


deteriorate. For even lower values, the FD values become noisy and eventually unusable. This 
highlights the difficulties involved with FD of finding a suitable step size h. For stiff problems, 
the values of h should be scaled by the central coeflicient ag of the FVM discretization, that is 
the corresponding diagonal entry of the discretization matrix. 


4.4 Shape Adjoints 


Shape optimization, in contrast to topology optimization, aims to improve the geometry of the 
domain directly, by manipulating the position of the nodes on the outer shell. ‘This has the 
advantage, that the physical wall boundary conditions are retained, allowing the wall stresses 
to be evaluated more accurately. Other boundary conditions like thermodynamic heat flows 
can be implemented much easier (or no change is required at all), if a physical wall is present. 
Usually, the points on the surface are moved in the direction of the normals of the corresponding 
faces. When the points on the outer surface move, the interior points have to be adapted as well, 
such that the mesh quality does not degrade, leading to unreliable results or lack of convergence. 
Two of the simplest approaches to mesh movement are to solve a system of spring /dampener 
equations or to apply a Laplacian smoothing. The downside of shape optimization is that the 
mesh movement limits the range of optimization, as the design will retain some resemblance to 
the original design, making it likely that a local minimum, instead of the global minimum, is 
hit. It is also challenging to preserve design features during the mesh morphing process, as the 
morphing tends to smooth out sharp features. 

To facilitate the calculation of shape sensitivities in OpenFOAM, we register the positions 
of the individual points in the mesh as soon as they are created. ‘The mesh is set up in 
meshes/polyMesh/polyMesh.C, and the points are created in the constructors 








Foam::polyMesh::polyMesh(const IOobject& io){L...]} 
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and 


Foam::polyMesh::polyMesh 

( 
const I0object& io, 
const Xfer<pointField>& points, 
const Xfer<faceList>& faces, 
const Xfer<labelList>& owner, 
const Xfer<labelList>& neighbour, 
const bool syncPar 


oe ieee It 


respectively. 
After the mesh and boundaries have been created the points on the boundary can be registered 
in the tape, making them parameters for the following calculations. 


forAll(boundary_,patchI)f{ 
forAll(boundary_[patchI],faceJ){ // loop over all boundary faces 
const labelList 11 = boundary_[patchI][faceJ]; 
forAl1(11,j){ // register points of boundary faces 
ADmode:: global_tape->register_variable( points_[1l1l[j]J][0] ); 
ADmode:: global_tape->register_variable( points_[L1l1l[jlJ][i] ); 
ADmode:: global_tape->register_variable( points_[1l1l[j]J][2] ); 
Jr 
i 
if 


One could also register all points of the whole mesh. However, as we are only interested to move 
the points on the surface, and the inner points are adapted by a mesh smoothing technique, 
we only register the points on the surface to save some tape memory. The set of parameters is 
thusy = Q € R°*”er. 

After the adjoint propagation is completed, the adjoints of the individual points of the mesh can 
be extracted. However, one is usually not interested in the raw adjoints but wants to constrain 
the movement of points to the normal direction of the associated faces. ‘Those adjoints can not be 
directly read from the tape, but can be calculated in a post-processing step from the adjoints of 
the individual points q. For each boundary face, the adjoints of the points contained in the patch 
are interpolated to the face interior by averaging the vectors, giving the face centered adjoint 
sensitivity vector qr. Once the face center vector is calculated, it can be constrained to the face 
normal direction by taking the scalar product of qr with the face normal n. We define this scalar 
product as the shape sensitivity G € RT: 








Or. ‘NFR, 
6, — TAA O<i<nnr, 
Ap. 
where Af is the area of the boundary face, required to obtain a mesh independent sensitivity. 
For better compatibility with post processing tools, the resulting face defined quantity is copied 


to the face adjacent cell. The interpolation procedure is outlined in Listing 4.1. 


4.4.1 Checkpointing of Shape Adjoints 


Conceptually, the application of checkpointing remains unchanged from the case of topology 
optimization. Compared to topology optimization, the pre-processor stage is much more complex. 
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// loop over all boundary patches 
forAll(mesh.boundary(),bi)f{ 
// loop over all faces in boundary bi 


forAll(mesh.boundary()[bi],i)¢{ 
// Agst of point indices “fo, face i on bi 
const labelList face_points = mesh. boundary ()[bi].patch() [il]; 
// cell in domain corresponding to boundary face 
const label face_cell = mesh.boundary() [bi].faceCells() [il; 
const Foam::vector face_normal = mesh.boundary() [bi] .nf()() Lil; 
Foam:: vector sensVec(0,0,0); 
forAll(face_points ,fp){ 
const point& pt = mesh.points()[face_points[fp]l]; 
sensVec[0] += dco::derivative(pt[0]); 
sensVeclil] += dco::derivyative (pe 11] ); 
sensVec[2] += dco::derivative(pt[2]); 
Ir 
sensVec /= face_points.size(); 
// scalar product of sensitivity vector with face normal 
sens[face_cell] = (sensVec & face_normal) / mesh.boundary() [bi].magSf() [il]; 
Jr 
i 


Listing 4.1: Interpolation of sensitivity vectors, defined at points, to the corresponding faces. 
The shape sensitivity is computed by computing the scalar product of the resulting averaged 
sensitivity vector with the face normal vector. 


In this stage the parameters, that is the location of the individual points of the mesh (contained 
in the primitive mesh), are used at various locations in the code to construct the CFD mesh 
representation. This mesh construction phase is only executed once and can not be restored from 
a checkpoint easily, therefore it is immediately included in the tape. Following the pre-processing 
phase, the tape is switched off and the usual checkpointed iteration phase begins. After all 
iteration steps have been adjoined, the remaining tape of the pre-processor is adjoined, yielding 
the adjoints of the parameters. 

A naive implementation yields results, which are not consistent with black-box adjoints, 
indicating that some dependencies are missed. ‘hose missing dependencies have been identified as 
the non-orthogonal correction vectors (compare Section 2.1.4) by manually comparing the tapes 
of black-box and checkpointed adjoint. ‘The reason the dependencies are missed is the presence of 
on demand functions in OpenFOAM. Several data fields in the mesh object are stored in dynamic 
memory, and are only constructed once they are first requested by their access routine. 

The following access functions in the fvMesh class create their fields on demand: 

















e C(): Constructs the cell center vector, 





e Cf(): Constructs the face center vector, 


e V(): Constructs the cell volume vector, 





e Sf(): Constructs the face area vectors, 





e magSf(): Constructs the magnitude of face area vectors, 


e deltaCoeffs(): Constructs delta coefficients, 
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void init_mesh(Foam::fvMesh& mesh) { 
mesh.Sf(); mesh. magSf () ; 
mesh.C(); mesh.Cf(); 
mesh.V(); mesh.deltaCoeffs(); 
mesh.nonOrthDeltaCoeffs(); mesh.nonOrthCorrectionVectors() ; 


t 


int main(int argc, char *argv[]) 


{ 
#include "setRootCase.H" 
#include "createTime.H" 


deo: als <.¢lloball_ tape = deo. als: jtape_t:.creare () ; 
#include "createMesh.H" 
#include "createFields.H" 


init_mesh(mesh) ; 
dco::ais::global_tape->switch_to_passive() ; 
L...] // checkpointed SIMPLE algorithm 

Ir 


Listing 4.2: Forcing the early on demand construction of the fvMesh fields by calling their access 
routines. 


e nonOrthDeltaCoeffs(): Constructs the non orthogonal delta coefficients, 


e nonOrthCorrectionVectors(): Constructs the non orthogonal correction vectors. 


Most of these functions are first accessed during the pre-processor phase, and thus the construction 
of the fields is captured by the tape. However, the non-orthogonal correction vectors are first 
constructed when discretizing the gradient operator in the momentum equations, using the 
corrected surface-normal gradient scheme. ‘The first occurrence of this discretization is in the first 
SIMPLE iteration, at which point the tape has already been switched off, to calculate the passive 
iterations needed for the checkpointing iteration. When the nonOrthCorrectionVectors() access 
function is subsequently called while the tape is active, only a reference to the field created earlier 
is returned. ‘Therefore the dependence of the correction vectors on the parameters is lost. 





To fix this problem, we explicitly call all on demand generator functions of the fvMesh instance, 
after the pre-processing is finished but before the tape is switched off. This might be redundant 
for some functions, if the field has already been initialized. However, as in that case only a 
reference is returned, which is subsequently ignored, the run time and memory cost of those 
additional calls is negligible. The actual constructors generating the data are private to the 
fvMesh class, and would require modifications inside the OpenFOAM code base in order to be 
accessible from our solvers. ‘Therefore we simply trigger dummy calls to the accessor routines, 
which have the side effect of creating the required data fields. ‘The changes required in order to 
obtain a consistent checkpointed shape adjoint are presented in Listing 4.2. 














The same fixes apply when using the checkpointing interface to implement reverse accumulation 
or piggybacking. Figure 4.20 shows sensitivity results over iteration count for a single point on 
an airfoil surface (compare to Section 4.4). Sensitivities are obtained by tangent mode, adjoint 
mode with checkpointing (due to the cost involved only evaluated every 20 iteration steps) and 
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Figure 4.20: Iteration history of shape sensitivity obtained by tangent mode, adjoint mode, and 
static piggybacking. 


static piggybacking (no design update). All converge to the same solution, with tangent and 
checkpointed results without SDLS being identical within machine precision. The adjoint solution 
with SDLS exhibits a maximum difference from the tangent solution of 0.6% and the final value 
matches up to 0.002%. 


4.4.2 Verification of Shape Adjoints 
Numerical Verification using Higher Order AD 


Using the second order differentiation model, the AD implementation of the adjoints can be 
verified against tangents inside the same solver. As a by-product to the second order derivatives 
generated by the tangent over adjoint model, the model includes both the first order tangent and 
adjoint models. 


x1) = Vio ya) 
yl?) =Vf- x (2) 


Ignoring the second order derivative components, and using the correct seeding, the tangents and 
adjoints can be calculated within the same solver. ‘To get representative adjoints, the full SIMPLE 
iteration history has to be adjoined back to the inputs after each iteration step. Using the scalar 
tangent mode, only one tangent (i.e. the x,y or z component of a single point) can be verified 
at a given time. Only a subset of points can thus be verified in a reasonable amount of time. 
We applied this approach and did not observe any discrepancies beyond the usual floating point 
precision differences. After the correctness of the AD model implementation has been checked, 
we focus on showing that the results are also consistent to results obtained by the continuous 
adjoint method. 
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Figure 4.21: Mesh of a cylinder in laminar flow. Inlet on the left, no-slip wall on the cylinder 
surface, zero pressure outlet on the right, top, and bottom boundaries. ‘The mesh used for the 
calculation has double the resolution. 


Verification Against Continuous Adjoints 


To verify the shape adjoints, laminar steady flows around a cylinder in a channel were studied. 
This is a well studied problem, which exhibits different kinds of behaviors for different flow 
conditions. For Reynolds numbers of one and lower, the flow closely resembles a potential flow 
and stays attached to the cylinder. For Reynolds numbers of around 10, a recirculation area 
begins to form behind the cylinder, however the flow remains steady. For Reynolds numbers of 100 
and above, the flow becomes transient. ‘The cylinder begins to shed vortices with a characteristic 
frequency (characterized by the dimensionless Strouhal number St), due to the oscillating pressure 
field in the wake of the cylinder. ‘This flow phenomenon is commonly known as the von Karman 
vortex street |Von54|. While still laminar at first, for growing Reynolds numbers the transient 
flow becomes increasingly turbulent. 

To obtain steady solutions, which exhibit different flow characteristics, two cases with Reynolds 
numbers Re = 2 and Re = 20 were studied. A structured block mesh around a cylinder of unit 
diameter was chosen. The structure of the mesh is presented in Figure 4.21, however the actual 
mesh used for the calculations has double the spatial resolution (and thus four times as many 
cells). The flow enters the domain through an inlet on the left, with a prescribed flat velocity 
profile. It exits the domain mainly through the outlet on the right, however also the lower and 
upper boundaries are configured as outlets. The domain is possibly not big enough to eliminate 
all influence of the outflow boundaries onto the flow near the cylinder. This would be required 
to reliably determine the frequency of vortex shedding. However, as we are interested in steady 
cases and want to compare the sensitivities obtained by different approaches of differentiation, a 
minor influence of the boundary conditions on the flow is deemed irrelevant. 





The laminar flow around the cylinder for both cases is illustrated in Figure 4.22. ‘The former case 
exhibits flow around the cylinder with the streamlines of the flow near the cylinder following the 
cylinder surface tangentially. The latter flow is still laminar and steady, however the streamlines 
detach somewhere after the point of maximum cylinder width and two recirculation areas of 
opposing rotation form in the wake of the cylinder. Further downstream the streamlines converge 
together again. 
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Figure 4.22: Laminar flow around cylinder for Re = 2 (left) and Re = 20 (right). For the lower 
viscosity, a recalculation area begins to form in the wake of the cylinder. 


Sensitivities are calculated with respect to the power loss between the inlet and outlet, as 
introduced in Section 3.1.3. Similar results can be obtained by choosing the drag on the 
surface of the cylinder as cost function. ‘The discrete results are obtained by applying the 
piggyback method to the simulation, without performing design updates, for 500 iterations. 
Continuous adjoint results are obtained by running a modified version of the stock OpenFOAM 
solver adjointShapeOptimizationFoam, expanded to calculate the wall sensitivities according to 
Equation (2.11) after each iteration step. 

The results are presented in Figures 4.23 and 4.24. The former shows the sensitivities in a 
Cartesian plot, the latter in a polar plot. For both plots the abscissa, ranging from —7 to 7, is 
the position along the surface of the cylinder, starting from the stagnation point at the outmost 
left position of the cylinder (due to the unit radius it is also the angle between the x-axis and the 
position on the cylinder surface in rad). The ordinate gives the sensitivity of the cost function to 
translation of the cylinder surface points in surface normal direction. A negative value indicates a 
movement in negative normal direction, squishing the cylinder together and reducing the volume 
of the cylinder. A positive value expands the cylinder in direction of the normal and thus increases 








the volume. 
The results of the discrete adjoint Gp have been scaled by a uniform factor A to best match Go: 


min ||B¢ — MBplly- 


Due to the linearity of the adjoint momentum equation, the result of the continuous adjoint 
calculation is linearly dependent on the adjoint inlet velocity, which can be arbitrarily chosen. 
Therefore a linear factor between the discrete and continuous solution does not indicate a problem, 
and would be eliminated by the step size control of an optimization scheme. 

The results show a very good match between the adjoints produced by the discrete adjoint and 
the continuous adjoint, especially considering that they are produced in a considerably different 
way. While the match between the discrete and continuous adjoint is best judged from the 
Cartesian plot, the influence on the shape can be better recognized in the polar plot. For the low 
Reynolds number case, the results indicate that an optimization would steer toward an ellipsoidal 
shape, in order to lower the surface area of the obstacle presented to the flow. If we assume the 
origin to be the center of the cylinder, the magnitude of the mesh movements is both symmetric 











around the x-axis and y-axis. This matches the symmetry of the geometry, mesh, and flow field. 
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— Discrete Re = 2 
— Continuous Re = 2 
— Discrete Re = 20 
— Continuous Re = 20 


Adjoint shape sensitivity 6 





1 -§ 
Position on cylinder surface (rad) 


Figure 4.23: Sensitivities over location (in radians) on the cylinder surface in Cartesian coordinate 
system. Origin is the stagnation point on the front of the cylinder. 


— Discrete Re = 2 
— Continuous Re = 2 
— Discrete Re = 20 
— Continuous Re = 20 
--- Zero sensitivity 





Figure 4.24: Sensitivities over location (in radians) on the cylinder surface in polar coordinate 


system. 
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Figure 4.25: Shape optimization with fixed volume constraint of the unit cylinder. 


For the higher Reynolds number case, the flow field is not symmetric around the y-axis anymore 
and neither are the sensitivities. The sensitivities indicate a shape more formed like a hourglass 
(note, that the linearized gradient only guarantees an improvement of the cost function for an 
infinitely small perturbation, an actual optimized geometry might look different), likely to prevent 
the separation of the flow from the cylinder and to combat the formation of the recirculation area. 
For this solution, there are regions of positive sensitivity, indicating regions where a redirection of 
the flow is more important than to minimize the surface area of the obstacle. 








4.4.3 Shape Optimization 


After having established the consistency of the discrete shape adjoints, we will now apply the 
eradients obtained to optimize the cylinder geometry. A mesh morphing strategy, developed 
in [Mol18], using adaptations of existing OpenFOAM mesh morphers, will be used. The sensitivi- 
ties are supplied to the morpher to determine the amount of movement of the surface nodes. ‘The 
remaining nodes are moved using a Laplacian smoothing technique [Sor+04], distributing the 
surface movement into the domain, decreasing the movement with increasing boundary distance. 
The connectivity of the mesh remains unchanged during the optimization. ‘The sensitivity results 
obtained earlier suggest, consistent to intuition, to reduce the volume of the cylinder in order to 
obtain a lower drag on the cylinder body. Obviously the optimal solution would be to have no 
obstacle to the flow at all. In order to obtain a more meaningful optimization target, we will 
constrain the volume of the (unit) cylinder to its initial volume 7h. To enforce the constraint, we 
choose a simple penalization approach with 





JI = Ip (x,q) + A(Vo — Vi)°, 


where Vo is the sum of all cell volumes in the initial configuration, V; the sum of cell volumes in 
the deformed state and 4 a suitable scalar penalization parameter. Cell volumes are positive by 
definition, no absolute value is thus needed in the summation. Due to the fixed boundaries, the 
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Figure 4.26: Normalized drag and penalized drag on the cylinder surface over 200 iteration 
steps. 


change of sum of cell volumes is directly proportional (with inverse sign) to the change in cylinder 
volume. We start with a rather low A and repeatedly increase it during the optimization process 
to stronger enforce the constraint. ‘The application of the mesh morpher is consistent to a steepest 
descent method, as it moves the surface points by the gradient, scaled by a constant factor. 


The optimization loop repeatedly calls the (piggyback) solver to obtain a gradient and the 
mesh morpher to translate the sensitivities into an updated mesh. Using the established discrete 
adjoint framework, it is feasible to combine both steps into one application, potentially executing 
a mesh update (with small movements to ensure convergence stability) after each piggybacking 
step. A similar combination of different utilities has been carried out to combine the blockMesh 
and simpleFoam utilities, yielding a solver which allows to directly optimize for parameters of 
the blockMesh mesh description. This approach is detailed in Section 4.6. 


The movement of the cylinder surface from the baseline to an optimized configuration is shown 
in Figure 4.25. After 100 iterations of the shape optimization procedure, the drag is reduced 
by 5.8%, the volume constraint is violated by 0.85%. After 200 optimizer iterations, the drag is 
reduced by 6.6%, while the volume of the cylinder is much nearer to the target volume (0.12% 
violation of the volume constraint). 

The gradual reduction in drag is shown in Figure 4.26. Due to the penalization, the constraint 
violation never exceeds 3%. As the gradient 0J/0G nears zero, and the penalization parameter A 
is increased, the gap between J and Jp closes. 





One would suspect, that a globally optimal solution would narrow the cylinder even more. 
However, with a fixed mesh topology, for which only the point positions are morphed, the mesh 
quality will degrade for big displacements, leading to distorted meshes and eventually divergence 
of the solvers. Note, how the center of gravity of the deformed cylinder is not fixed and moves 
slightly in flow direction. If the cylinder is supposed to stay fixed, the squared movement of the 
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Figure 4.27: Baseline mesh and morphed meshes. Mesh after 200 iterations is barely regular 
with highly distorted cells. The geometry should be remeshed at this point. 





center of gravity could be introduced as an additional penalization term. 

The original and morphed meshes are depicted in Figure 4.27. ‘The mesh morpher keeps the 
mesh quality acceptable, such that e.g. no negative volumes occur, however after 200 iterations 
the quality has significantly degraded and the cylinder should be remeshed. 





4.5 Discrete Adjoint Residual Approach 


In this Section we will show the steps required to efficiently implement the discrete adjoint residual 
formulation, introduced in Section 2.10, in OpenFOAM. The big dimension but sparse nature of 
the involved Jacobians motivates using coloring techniques. 

First, we efficiently determine the non-zero pattern of the residual Jacobian. Second, we color 
the resulting Jacobian, by applying information obtained either directly from the mesh or from an 
intermediate graph representation. Third, we compute the Jacobian entries using AD or FD and 
use the resulting linear system to compute the desired sensitivities. Last, we present applications 
of this methodology to our reference cases. 








4.5.1 Calculation of Residuals in OpenFOAM 


In the following derivations, we focus on the case of steady laminar flows, and the parameter set 
required for topology optimization. ‘The flow is thus characterized by the velocity and pressure 
fields. ‘To incorporate turbulence or other physical quantities, the states, and therefore also 
the sparse Jacobian of the residuals, can be expanded. ‘The face flux field @ depends on the 
velocities, but can not readily expressed by it with an explicit formula, because it is iteratively 
corrected (see Section 2.3). Therefore, to obtain accurate adjoints, the face flux is introduced as 
an independent variable to the residual Jacobian [RU13]. The state x is assembled from both 
cell centered quantities (u,p) and face centered quantities (face flux ¢), yielding the resulting 
state vector x = (U,p, @) € R*"¢t”F. Consequently the residual R = (Ry, R,, Ry) € R*Ct"” 
is also split between cell centered (Ry, R,) and face centered (Rg) entries. 
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The residual vector Ry of the momentum equation can be calculated, either using the built in 
residual function of OpenFOAM, 


fvVectorMatrix UEqn( 
fvm::div(phi, U) 
+ turbulence ->divDevReff (U) 
+ tym: Sp Calpha.. U) 
- fvc::grad(p) 
ee 
volVectorField URes = UEqn.residual (); 


or by explicitly calculating the residual of the linear equation system: 


volVectorField URes = (UEqn & U) + fvc::grad(p); 





Similarly the residual of the mass conservation equation R, can be calculated, either using the 
fvMatrix residual function, 


fvScalarMatrix pEqn( 

fvm::laplacian(rAtU(), p) —-— fve::div(philbyAé) 
es 
volScalarField pRes = pEqn.residual () ; 


or explicitly as: 
volScalarField pRes = (fvm::laplacian(rAU, p) & p) - fvc::div(phiHbyA) ; 


The update of the face flux is not calculated by solving a linear equation, but by an explicit 
update formula. For a converged case, the difference between two subsequent iterations of @ can 
be interpreted as the residual Rg: 





R5 a g' - g'! 
With the update formula of the face flux in OpenFOAM, 
volScalarField rAU(1.0/UEqn.A()); 
volVectorField HbyA(constrainHbyA (rAU*UEqn.H(), U, p)); 


surfaceScalarField phiHbyA("phiHbyA", fvc::flux(HbyA)); 
phi = phiHbyA - fvc::flux(rAUxfvc::grad(p)); 


the calculation of the residual can be implemented in OpenFOAM as: 


surfaceScalarField phiRes = (phiHbyA - fvc::flux(rAU*fvc::grad(p))) - phi; 


4.5.2 Prediction of Jacobian Sparsity Pattern 


In order to efficiently compress the Jacobian of the residuals, as described in Section 2.11, the 
sparsity pattern, that is the individual positions of the non-zero entries, of the Jacobian has to 
be known. A naive way to determine the sparsity pattern is to calculate the Jacobian entries 
densely, and then check which entries are non-zero. ‘The resulting sparsity pattern can then be 
used to recompute the Jacobian more efficiently. This obviously is not very efficient and requires 
that the Jacobian sparsity pattern is reused multiple times to yield any improvement. 
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For general purpose computer programs, the sparsity pattern can be obtained more efficiently by, 
instead of calculating the actual entries of the Jacobian, only determining the boolean dependence 
of the outputs on the inputs. For a function y = f(x), with x € R”,y © R”, an output y; 
depends on an input x; if the partial derivative Oy;/Ox; = Ji; #0, and thus this dependence 
implies a non-zero in the Jacobian. The determination of boolean dependence can be implemented 
as the (forward or reverse) propagation of dependency sets. This process is outlined below. 

Let f : R” — R” be implemented by elemental functions y with intermediate variables v. We 
assign to each variable a set D,, C N. For the 7-th input, we initialize D,, = {i}. For a unary 
function vz, = yz(v;), the dependency is propagated unchanged from v; to vz: 














D = Dy,. 


VUk=Pk (U3) 





For binary functions vy = Yx(vi,v;), the dependency set is determined by the union of the 
dependency sets of its inputs: 


D =a BO oe 


Uk=Pk (u; ,Uj) 





Instead of implementing the dependency directly as sets, in dco/c++ a similar approach is chosen 
where the sets are modeled as bitsets of fixed length. This increases the memory footprint of the 
program and may necessitate to split the dependency calculation into multiple parts (similar to a 








driver for tangent vector mode), but reduces the run time for the individual union operations on 
the dependencies from O(log(n)) to O(1), due to the direct memory lookup of the bitset. 
Second order dependencies (needed for the assembly of Hessians) can be obtained by similar, 
but computationally more expensive methods [Var11]. 
For OpenFOAM meshes, the determination of the full Jacobian sparsity pattern, using the 
dco/c++ pattern data type, consumes a significant amount of time. However, if the dimensions 








of the finite volume stencils are known, the sparsity pattern can be exactly constructed from 
the mesh connectivity information. A more efficient way to determine the full sparsity pattern 
is thus to use the pattern type, to obtain the stencil size of an arbitrary cell inside the domain. 
The full sparsity pattern can then be constructed from this reference stencil and the mesh 
connectivity information. 

The stencil sizes required for the calculation of the desired Jacobian are shown in Table 4.2. As 
can be seen, the stencil is rather compact, with the biggest stencils required for the calculation 
of R, and Rg. The definitions for cell and face centered stencils are illustrated in Figure 4.28. 

The resulting block structure of the Jacobian of the state residuals is shown in Figure 4.29. An 
exemplary sparsity pattern, exhibiting these blocks, for the angled duct case with 325 cells, is 
presented in Figure 4.30. 








Table 4.2: Stencil size for the individual Jacobian residual blocks. Superscript « indicates a cell 
centered cell stencil, + a cell centered face stencil, t a face centered cell stencil, and ¢ a face 
centered face stencil. 
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(a) Cell centered stencils (*,+) (b) Face centered face stencil (o) (c) Face centered cell stencil (1) 





Figure 4.28: Face stencil around central cell (a), face stencil around central face (b), cell stencil 
around central face (c). 
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Figure 4.29: Sub blocks of the residual Jacobian. 
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Figure 4.30: Jacobian of residual of the reference case with ng = 325, nr = 600, resulting in a 
RPX157 sparse matrix with n,, = 50350 non-zero entries. Sub-blocks of the Jacobian are 
indicated with different shades of gray background. 
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4.5.3 Calculation of Discrete Adjoint Residual 


The discrete adjoint residual approach was implemented using the adjoint mode of AD, allowing 
to determine the Jacobian of the residuals. For validation and performance comparison, it was 
also implemented in tangent mode and with FD. Colorings of the Jacobian are obtained by using 
graph coloring algorithms, implemented in ColPack |Geb+13]. When coloring and compressing 
the Jacobian with ColPack (using smallest last heuristic as ordering) we observe slightly fewer 
colors when compressing columns than rows, giving the tangent mode and F'D a slight advantage. 
In order to evaluate the matrix vector product (OR/da)* - Ax from Equation (2.19) in tangent 
mode and FD, the matrix (OR/da)’ needs to be explicitly calculated. To efficiently calculate 
this matrix, a separate coloring for the parameters @ is performed, yielding significantly less colors 
than needed to compress the full residual Jacobian, due to the stencil size of the parameter being 
limited to one. The calculation of the residual Jacobian is fastest in (one sided) FD mode, followed 
closely by the adjoint mode, which performs well due to the high amount of re-interpretation 
(one seed and tape interpretation for each color but no additional tape recording). Tangent 
mode exhibits a constant run time overhead compared to FD. Utilizing tangent vector mode, 
the performance should be competitive to FD. For simplicity of the driver, this was not pursued 
further. ‘The adjoint mode can be utilized in a vector mode as well, allowing reverse propagation of 
different seeds at the same time. This can potentially be used to further speed up the calculation 
of the Jacobian, at the cost of higher memory usage. 

Preparing for the solution of the linear system (OR/Ox)* - Ax = (O7/Ox)* , the coefficients 
of the Jacobian are stored in an Eigen |GJ+10]| sparse matrix. Storage in a native OpenFOAM 
format would be preferable, particularly to preserve parallelism during the solution, however 
OpenFOAM lacks convenient general purpose linear equation solvers for fields which are not 
connected to a specific geometry. The solution of the linear equation system is calculated with 
either the Eigen SparseLU or Eigen BiCGStab (with incomplete LU preconditioner) solvers. 
In our implementation, the overall run time is heavily dominated by the solution of the linear 
system, making the overhead of the AD tool less significant. The memory requirement to reverse 
the residual evaluation by AD is considerably lower than to tape a full SIMPLE iteration step. 
The memory required for the tape is of the same order of magnitude as the space needed to 
store the full sparse Jacobian matrix. For reference, to reverse one full iteration of the finer 
case, introduced below, about 610 MB of tape space (with SDLS enabled) are needed, while the 
tape size required to capture the calculation of the FVM residual is 408 MB. The total memory 
consumption including the storage and solution of the sparse matrix is 1250 MB. 

Therefore, despite requiring multiple evaluations of the tape to obtain the full Jacobian, the 
adjoint method is competitive with FD, in both run time and memory consumption. 

To illustrate the results of the discrete adjoint residual solver, we once again turn to the angled 
duct example. Because this case is a laminar 2D case, the state vector consists of x = (uz, Uy, Pp, d) 
and consequently the size of the residual Jacobian is (83n¢ + nf) X (8NC + NP). 

We investigate two different mesh resolutions, a coarse mesh with ng = 2925 cells and 
nr = 5700 internal faces and a fine mesh with no = 46800 cells and ng = 93000 internal 
faces. A ColPack bipartite graph representation is build from the sparsity pattern, then partial 
row/column distance two coloring is applied to obtain a suitable Jacobian coloring. Coloring the 
columns of the coarse Jacobian using ColPack yields 90 colors, coloring the rows 102. The fine 
Jacobian yields slightly more colors, namely 95 for coloring the columns and 106 for coloring the 
rows. As the general connectivity of the mesh is not changed by the mesh refinement, the lower 
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Figure 4.31: Solutions for the fine angled duct case obtained by the discrete adjoint residual. 


limit for the number of colors likely does not increase with finer meshes; However, the coloring 
heuristic performs slightly worse for the finer mesh. 


Figure 4.31 shows the sensitivity results of the angled duct case, as obtained by the discrete 
adjoint residual method. ‘The first row of figures shows the adjoint velocity and pressure, which 
can be extracted from the solution Ax, = (U, D, op) and have the same physical meaning as the 
adjoint velocities and pressures defined for the continuous adjoints. The second row shows the 
final sensitivities d7/da, as well as the sign of the sensitivity for easier cross reference. 





The sensitivity results are identical for tangent and adjoint mode (up to machine precision) 
and align with the FD results very well. The results also match the results obtained by both the 
discrete black-box differentiation and the continuous adjoint presented in Section 4.3.3. 


Note that the adjoint velocities u obtained by this method are vector quantities. ‘They can be 
used to evaluate Equation (2.11) to obtain shape derivatives, equivalent to the continuous adjoint 
approach, circumventing the need to differentiate through the generation of the mesh from the 
individual points. 


Table 4.3 lists the run times of the following phases for adjoint, tangent, and FD mode: 
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Table 4.3: Run times of the different stages; Memory consumption for adjoint, tangent, and 
FD solver for the coarse and fine level angular duct. Passive run time is calculated over 200 
iteration steps. 


Coarse Case A1S T1S FD 

Colors 102 90 / 23 90 / 23 
Passive 6.638 4.938 2.268 
Pattern 0.588 0.60s 0.638 
Diff 0.748 1.168 0.348 
Solve 2.818 3.158 2.488 
Total run time 10.90s 10.448 6.178 


Max memory 176.39MB 155.54MB- 131.35 MB 


Fine Case AlS TIS FD 


Colors 106 95 / 25 95 / 25 
Passive 122.888 100.00 s 40.528 
Pattern 9.948 10.248 11.478 
Diff 14.99s 20.668 5.688 
Solve 147.408 166.428 152.858 
Total run time 303.118 307.008 217.008 


Max memory 1254.34MB 871.55MB- 805.51 MB 


Passive evaluation: Iteration of the case from initial condition to a converged state. 





Assembly of sparsity pattern: Assembly of the expected sparsity pattern of the Jacobian from 
mesh connectivity. 


Coloring: Conversion of sparsity pattern to ColPack graph format and partial coloring of the 
bipartite graph. 


Differentiation: Calculation of the Jacobians Jx and Ja (Jq only required for tangent mode and 
FD). 


Solution: Solution of the linear system and calculation of (2.18). 


All stages are performed in one solver. In practice it can be useful to separate the stages, as 
the sparsity pattern and coloring remain constant for a specific mesh and are independent of 
e.g. boundary conditions. ‘They can thus be reused for different configurations of the same case. 
The iteration procedure can be started from a partially or fully converged state (which can be 
created by a fully passive version of OpenFOAM) instead of the initial state, lowering the time 
for passive evaluation. 
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4.5.4 Directly Obtaining Colors from the Mesh Representation 


In order to reduce the computation time of the Jacobian, a coloring approach, as presented in 
Section 2.11, is used to compress rows or columns of the Jacobian. Previously we used the external 
software package ColPack to obtain a suitable coloring. ‘This necessitated the calculation of the 
non-zero pattern, as well as the construction of a graph structure from the Jacobian non-zero 
pattern. We will now introduce a method to directly obtain a feasible coloring from the FVM 
mesh representation. 


Mesh connectivity graph 


To analyze coloring problems, arising from the discretization of CFD problems, one would like 
to utilize already well known results from graph theory. ‘Therefore, it is desirable to introduce 
a graph representation of the mesh connectivity, as some properties defined on graphs can be 
reused on the mesh description. 





Definition 19 (Mesh connectivity graph). 
Let M = (L,U) be a mesh addressing given in LDU format, with L,U € N"¥. We define the 
corresponding mesh connectivity graph Gy = (Vu, Eu), consisting of: 


e One node for each cell in the mesh: V = {v; |i =0,...,ne — 1}; 


e Lach edge in the graph corresponds to an interior face connecting two cells, that 1s a pair 
(1;, u,;) = (L, U): f= {(Cz,;, Cu; ) | 1=0,...,np- N. 


A cell is directly adjacent to another cell if it shares a face in the mesh. This corresponds to an 
edge in the mesh connectivity graph. The distance between two cells (cj, c;) can be conveniently 
defined as the length of the shortest path in the mesh connectivity graph connecting both cells. 





Definition 20. 

Let v; and v; be cells of a finite volume mesh, that is they are vertices in Gy. We define the 
distance d = D(uj,v;) € N between cells vi and v; as the length of the shortest path in Gy 
connecting the nodes corresponding to both cells. 


Figure 4.32 gives an example 3 X 3 mesh with 9 cells and 12 internal faces, its corresponding 
mesh connectivity graph, and its internal LDU representation. 


Application of Mesh Connectivity to CFD problems 


Definition 21 (Cell Stencil). 

In CFD the stencil of discretization is defined by the cell neighborhood around a cell which 
influences the value of the solution at this specific cell in the next iteration step. The size of this 
stencil 1s defined as the maximum distance between the two cells in the cell adjacency graph. 


An example of two non overlapping stencils of size two on a structured mesh is shown in 
Figure 4.33. ‘The spreading of information over consecutive iteration steps, as well as the 
distance 2d neighborhood needed for the later proof, is illustrated in Figure 4.34. 

We will show that a distance 2d coloring on the cell adjacency graph corresponds to a partial 
distance two coloring of the bipartite graph of the Jacobian. Thus it can be used to compress the 
Jacobian matrix of the residuals. The same can easily be shown for face stencils. 
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Figure 4.32: Mesh connectivity graph, derived from ldu Addressing corresponding to 3 x 3 
mesh. 









































Figure 4.33: Non overlapping stencils of size two around two cells located at distance four. 
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Figure 4.34: Mesh connectivity graph of 1D mesh. Node 2 and 7 are connected with dashed 
edges to all nodes reachable by distance 2d = 4. 
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Figure 4.35: Bipartite graph for 1D structured mesh with nc = 10 and finite volume stencil of 
size two. Nodes rg and rz (or cg and c7 for column compression) can share the same color, 
as no path of length two exists between them. Blue edges indicate matrix entries created by 
stencils of size one, red edges additional matrix entries created by stencil of size two. 








Theorem 8. 

A distance 2d coloring, where d is the maximum stencil size used in the finite volume discretization, 
on the mesh connectivity graph Gy = (Vy, Em), can be used to compress the Jacobian of the 
residual of an equation discretized by this stencil. 


Proof. We will transform the coloring problem on Gy into an equivalent problem on the bipartite 
adjacency graph, which can be colored according to Lemma 1. Let two cells u;,v; be at distance 
d or less in the mesh connectivity graph, that is, starting from cell 7, cell 7 can be reached by 
crossing at most d faces. Then the undirected edges (r;,c;) € Eg and (r;,c;) € Ep are both part 
of the bipartite graph Gg = (Vg, Eg) of the Jacobian induced by the finite volume stencils (as v; 
is inside the stencil of v; and vice versa). 








Let two cells u,;,u; be at distance of 2d or less then there exists a path in the bipartite graph 
of length two ry > cy > ry with Duy, vg) <d and D(vz, vj") < d. Thus, cells vy and vj, may 
not share a common color for a direct recoverable row compression (Lemma 1). Due to symmetry, 
the same argument holds for column compression and the path cy > rz — cj. 

Now let the two cells vy, vj. be at distance of at least 2d+1. There exists no path ry > cy > rj) 
in the bipartite graph, as for each k either D(cy, cy) > d or D(cx, cj’) > d, or both. Thus, cells vy 
and vj can share a color without breaking the distance two condition on the bipartite graph. 

Therefore, a mesh connectivity graph, colored such that no nodes at distance 2d or lower are 
colored with the same color, corresponds to a bipartite graph with a valid (partial) distance two 
coloring. Such a coloring can be used to compress the Jacobian calculation (Lemma 1). LJ 
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Figure 4.36: Distance-0, distance-1, and distance-2 stencils in three dimensions. Stencils 
exploded along depth axis. 


With the mesh connectivity graph, a valid coloring can be obtained directly on the mesh 
description (see Section 4.5.2). The OpenFOAM LDU mesh description implicitly models the 
mesh connectivity graph. The following four ordering heuristics, which are ordered from least to 
most computationally expensive, have been implemented directly in an OpenFOAM solver. 


Natural Ordering: Color cells in order of their cell numbering. 
Random Ordering: Color cells according to a random permutation of their cell numbering. 


Approximate Largest First: Compute the size of the distance-1 neighborhood for each cell and 
sort big to small. Color in this order. ‘The distance-1 neighborhood is already available in 
the OpenFOAM mesh representation, the evaluation of the neighborhood size is thus cheap. 


True Largest First: Compute the size of the full distance-4 neighborhood for each cell, and sort 
big to small. Color in this order. 


First we investigate the coloring performance for a structured hexahedral n X n X n mesh of 
the unit cube. A distance two stencil (blue), as well as all cells colored with color zero (red) for 
the 10 x 10 x 10 unit cube, are shown in Figure 4.39. A graphical representation of structured 3D 
stencils up to distance-2 are given in Figure 4.36. 

All above heuristics exhibit a run time behavior linear in the number of cells ng = n° of the 
mesh. This is to be expected, as each cell is individually colored, and the cost per cell is constant 
as is argued below. For every cell, a breadth-first search [|Moo59| (BFS) is performed, however 
the BFS is stopped after a fixed number of steps (the desired coloring distance). Thus the run 
time of a single BFS is independent of ng. Instead it depends on the connectivity of the mesh 
and the type of the used cells. The linear run time behavior can be seen in Figure 4.37, where 
the resolution of the unit cube is scaled from N = 10° to N = 100%. The corresponding number 
of colors required to color the Jacobian are shown in Figure 4.38. The number of colors required 
remains largely constant as no rises. 
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Figure 4.37: Run time behavior of the presented heuristics. Run time scales linear with no. 
-10° 


60 |} 


Number of colors 
Number of cells ng = n° 





10 20 30 AO 50 60 70 80 90 100 
Cube edge resolution n 





—— Natural Ordering — Random Ordering 





— Approx. Largest First Ordering —— Largest First Ordering 


Figure 4.38: Number of colors required to color the unit cube for the presented heuristics. 
Number of cells nc dashed on second axis. 
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Figure 4.39: Unit cubed meshed with 10 x 10 x 10 cells, colored with 36 colors. Only cells of 
the first color (red), as well as one stencil (blue) is shown. 


For unstructured meshes, the true largest first heuristic performs very well. For structured 
meshes, it performs worse, which is explained by the uniform size of the cell neighborhoods, 
yielding many nodes with the same degree. In this case the heuristic basically falls back to a 
natural ordering (natural ordering is used as tie-breaker), except near boundary patches, where 
the number of cell neighbors is lower. 














In the implementation it has been observed, that using a flat (vector based) set representation 
instead of the red-black tree [Cor+09]| based implementation in the C++ standard library |Pla+00] is 
considerably more efficient when performing the BFS on the cell neighborhood. An implementation 
of the BFS search around a given cell in the mesh representation is given in Listing 4.3. ‘To 
improve performance and reduce memory, the recursive BFS algorithm is explicitly unrolled to a 
loop based implementation, 

The size of the cell neighborhood is determined by the number of neighboring cells. For 
structured hexahedral meshes, the size of the neighborhood is determined by 


2d+1 1D 
mn = ¢ d?+(d+1)? 2D 
5 (4d? — 6d? + 8d — 3) 3D, 


where d is the stencil size. ‘The full distance-4 neighborhood consequently contains 41 cells for 2D 
and 129 cells for 3D structured hexahedral meshes. 

For unstructured meshes, the size of the cell neighborhood is obviously much more varied. Here 
the largest first heuristic performs well. While this heuristic takes considerably longer to evaluate, 
it produces an ordering that yields significantly fewer colors. This is illustrated by the motorbike 
and VW Polo cases in Table 4.4. The motorbike case consists of 352 863 cells and has a maximum 
cell neighborhood of size 455, due to the polyhedral nature of the mesh. The VW Polo case is hex 
dominated, with tetrahedrons connecting the interior elements to the boundary surfaces. This 
mesh contains 7.75 million cells, the maximum size of the distance-4 cell neighborhood is 595. 

The run time of the coloring heuristics is dominated by the BFS during the coloring stage. 
Only for the true largest first coloring heuristic the creation of the ordering takes a significant 


177 


4 Implementing Efficient Algorithms for Steady Flows 


Table 4.4: Number of required colors and run time in seconds for the different ordering heuristics. 


Cube 10° Cube 100° Pitz-Daily 
Heuristic Colors Runtime Colors Runtime Colors Run time 
Natural Ordering 4] 0.03 45 19 19 0.07 
Random Ordering A8 0.02 58 27.4 25 0.08 
Approximate Largest First 44 0.03 62 18.4 25 0.08 
True Largest First 36 0.04 62 32.3 2a 0.13 
Pitz-Daily 3D Motorbike VW Polo 
Heuristic Colors Runtime Colors Runtime Colors Run time 
Natural Ordering 48 2.04 99 13.54 110 307.23 
Random Ordering 56 3.18 90 12974 101 384.95 
Approximate Largest First 60 2.15 96 14.09 115 315.02 
True Largest First A8 3.84 19 25.89 87 622.12 





amount of time, as it also evaluates the cell neighborhood using BF'S. ‘The overall run time for 
the true largest first scheme can potentially be improved, at the cost of much higher RAM usage, 
by saving the BFS results during the ordering phase and reusing them during coloring. 

The true largest first distance-4 coloring on the mesh produces results which are competitive 
to colorings obtained by ColPack |Geb+13] on the adjacency graph of the Jacobian. For the VW 
Polo case, which is the biggest case we considered, ColPack produced 85 colors, compared to 
the 87 colors obtained by the coloring implemented directly in OpenFOAM. A visualization of the 
cell colors of the near wall cells, mapped onto the surface of the car body is shown in Figure 4.40. 


1 set<label> calc_cell_neighborhood_BFS(label root, label k, fvMesh& mesh){ 
set<label> neigh; 
neigh.insert (root) ; 
set newCells = neigh; // copy root set 
for€int i= 0; i < k; i++)f{ 
set<label> tmpCells = newCells; // iterate over all cells of old level 
newCells.clear(); // empty set for next level 
for(label it : tmpCells)f{ 
labelList& newNeigh = mesh.cellCells (it); 
forAll(newNeigh ,j)f 
// returns pair<iterator,bool>, true if element not previously in set 
auto ret = neigh.insert (newNeigh[j]); 
// only add new elements for further exploration in next level 
if(i < k-1 && ret.second) 
newCells.insert (newNeigh[j]); 
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return neigh; 
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20| } 
Listing 4.3: BFS algorithm to identify the distance k neighborhood around cell index root. 
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Figure 4.40: Cell colors of distance-4 coloring of near wall cells. Plot is split at the y = 0 plane. 
On the y > 0 half of the plot only cells of color zero are shown in red. 


4.6 Parametric Optimization 


4.6.1 Introduction 


In addition to high dimensional optimizations, carried out by topology and shape optimization, 
the discrete AD model can also be applied to a parametric optimization setting. Parametric 
optimizations use a set of parameters, which model the shape of the geometry in some way. For 
example, a pipe could be modeled by a spline, defining the centerline of the pipe, and a set of 
radii, which model the cross section of the pipe. 

Compared to the state vector x € R”*, the dimension of the parameter vector y € R”” is 
rather low, with 1 <n, <n. Therefore a calculation using either tangent mode or adjoint mode 
is feasible. ‘he parameters are an input to the mesher, which transforms the parameters to the 
full mesh representation. 

To circumvent the need to differentiate the mesher and solver at the same time, a hybrid 
approach is possible. In this setting, the sensitivities of the generated points Q (output of 
mesher M) w.r.t. the design parameters dQ/dy are generated using tangent (vector) mode 
at cost O(m) - cost(M(v)). The sensitivity of the (scalar) cost function, with respect to all 
points d7/dQ, is generated in adjoint mode at cost O(1) - cost(J(F(y))). The final sensitivities 
can then be calculated by the following product: 


dJ OF OM 
dy 0Q O° 
An advantage of parametric designs is that, assuming a reasonable compact parametrization is 


chosen, the final design can be straightforwardly modeled in CAD tools by changing the parameters 
of the initial design to the final optimized values. The reduced parameter set of parametric 
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Figure 4.41: Illustration of a B-spline, defined by five control points (red). Spline in solid blue, 
corresponding control polygon dashed. For the optimization, only the inner three control 
points are treated as parameters. 


optimization also allows the application of advanced optimization schemes, as parameters can 
more easily be constrained and Hessian approximations can be assembled more effectively. ‘The 
remeshing after every optimization step ensures that the mesh quality stays consistent, which 
for a shape optimization, moving all surface points, is much more challenging and might require 
excessive amounts of smoothing of the surface features. 


4.6.2 Parametric BlockMesh Optimizer 


As a proof of concept, we differentiate through a combined solver of the mesher blockMesh and 
flow solver simpleFoam, using adjoint AD. 

The blockMesh mesh description syntax is a plain text description of blocks, faces, and 
edges. From this description, a mesh is generated by the blockMesh utility, using hexahedral 
elements. The resulting meshes resemble structured meshes (OpenFOAM meshes are always 
stored unstructured). Edges can be curved, where the curves are parametrized by splines and 





their corresponding control points. Different spline interpretations are available including Bézier 
curves, B-splines and Catmull-Rom splines. For this proof of concept, we only considered B-spline 
curves |De 78], as they best retain tangential relations in the geometry. The other spline types 
can be differentiated analogously. B-splines are defined by a control polygon, that is a piecewise 
linear path through the control points. ‘The spline connects the start to the endpoint, but does 
not pass through the intermediate control points. Such a spline is illustrated in Figure 4.41. The 
function defined by the control points, as well as its derivative are continuous, unless points are 
multiply defined. 

All spline control points are registered as parameters in the tape, as soon as the block edges 
are created from the control points. The mesh utility afterwards evaluates the spline at the 
required intervals and constructs the mesh primitives (points, faces, cells, etc.). After the mesher 
has finished, it returns a polyMesh object which holds the mesh information. Instead of reading 
the mesh information from the mesh files as usual, the finite volume discretization is instead 
constructed from this polyMesh object, leaving the derivative information between mesher and 
solver intact. 

The mesh generation with blockMesh is inherently limited to serial execution. ‘To retain 
the parallelism of the adjoint flow solver, the points generated by the serial mesher need to 
be distributed to the parallel nodes. ‘The dependencies of the points on the parameters can 
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Figure 4.42: Distribution of the global point fields Q, created by the mesher M, to local point 
fields, required for parallel solver execution *. Adjoints are seeded into y on processor Pp and 
then propagated back to the parameters +. 
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be retained by allowing AMPI to capture the communication. To facilitate this, we load the 
distributed finite volume mesh from an existing domain decomposition, but update the processor 
local point fields from the global field generated by the mesher (which triggers a recalculation of the 
primitive mesh). The global mesh point vector is thus split into several local parts and distributed 
by (A)MPI to the corresponding processors. This introduces a communication overhead, as well 
as the need to hold the whole point mesh on one processor (also required by the mesher anyway). 

The derivatives d7/dQ of the serial or parallel solver S can be calculated by any mean feasible 
to adjoin the iteration history. This step is essentially identical to the calculation of the full 
shape adjoints. For the proof of concept, we choose the piggyback approach, repeatedly adjoining 
steps of the SIMPLE algorithm until the adjoint of the state x is sufficiently small. The product 
with dM/dy can be calculated with a single interpretation of the remaining tape of the mesher. 


4.6.3 Application to Pitz-Daily Geometry 


As an application, we again use the Pitz-Daily backward facing step case, used for topology 
optimization in previous sections. ‘he step is replaced by a ramp, such that the width of the 
nozzle gradually expands. Depending on the steepness of the ramp, the flow might still detach 
or remain attached to the lower wall. Except for the change of mesh topology, the boundary 
conditions are kept identical to the case previously discussed. As a first step, to find a sensible 
initial shape, the angle of the ramp is varied, while keeping the walls straight. As can be seen 
in Figure 4.43, the optimal region is quite broad. Depending on the starting position a local 
optimization will run into different local minima, located at different angles but with only slightly 
differing values of the cost function. ‘The parameter yo = So, is chosen from the middle of the 
optimal region as 108°. The global optimum is located in this region, both according to a brute 
force evaluation of the cost functions for different angles, evaluated at 65 different positions 
of So, between 50mm and 180mm, and according to a gradient based optimization starting 
at So, = 50mm, employing random perturbations after convergence to escape local minima (basin 
hopping |WD97]|). As the variation of So, changes the length ratio between the different blocks, 
the cells in x-direction are dynamically redistributed between blocks B, and Bo (see Figure 4.44), 
retaining mesh quality while keeping the global cell count constant. 

Next, the shape of the wall is modeled by a B-spline with three interior control points. In 
the initial configuration the control points are placed, such that the wall is tangential to the 
connecting walls. The control points are allowed to move in x and y-direction, giving the optimizer 
seven DOF for the spline control points. ‘The gradient is assembled from the derivatives of the cost 
function, here again power loss, with respect to the positions of the control points, as well as the 








x-coordinate of the ramp endpoint, to allow corrections of the ramp slope if required. The vector 
of parameters is thus 7y = So, Bl) Ply) 92n) Oly) O8e> S3, | € R’, with the corresponding gradient 
of same dimension V4.7 € R’. The control points of the splines are bounded by [ymin, 0]? to avoid 
mesh breaking overshoots of the solution (which except for badly behaved FD approximations of 
the gradient was not strictly needed). 

The optimization was carried out using the SLSQP solver [Kra94|, implemented in the 
SciPy |JOP+01| optimization framework, as it both implements parameter bounds and lin- 
ear constraints and proved more reliable than other solvers. For the unconstrained (but bounded) 
case, the SLSQP method resembles the (Pseudo-) Newton method. Using a gradient obtained 
by adjoint mode and a Hessian internally approximated by SciPy, the optimization required 77 
passive and 26 active evaluations of the flow field to converge to the prescribed tolerance bounds. 
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Figure 4.43: Influence of the ramp angle on the power loss. The optimum is located in a broad 
shallow valley; To find the global minimum, depending on the starting position, a global 
optimization might be needed. 





Figure 4.44: Parametrized backward facing step model. Ramp is parametrized by B-Spline with 
three control points and seven DOF. Design baseline in blue, optimized geometry in red. For 
better visibility, the y-axis is scaled by a factor of two. 
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Figure 4.45: Flow inside the optimized Pitz-Daily geometry. 


The initial and final geometry are shown in Figure 4.44. Changing from the initial design to 
the final geometry only changes the geometry slightly, but still improves the power loss by about 
6.5%. The flow through the optimized geometry is shown in Figure 4.45. The solution closely 
resembles solutions obtained by topology optimization. 

While the optimization produces reasonable results, it tends to converge into local minima. 
This is evident from the fact that the optimization does not deviate much from the starting values 
of the splines in x-direction, regardless on how they are chosen. It is thus advisable to try several 
initial configurations or to employ a global optimization strategy, e.g. basin hopping. 
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In this chapter two different case studies are presented, highlighting the different approaches 
to optimization, and the flexibility the discrete adjoint framework allows. The former case is a 
topology optimization of a finely resolved 3D geometry, showcasing the parallel scalability of the 
adjoint solvers. ‘The latter case is a shape sensitivity analysis of a 2D airfoil at a high Reynolds 
number, showcasing the influence of turbulence and the stability of the discrete adjoint approach. 
The chapter is closed with a brief overview over other applications considered with the discrete 
adjoint OpenFOAM framework. 


5.1 Topology Optimization of 3D Pitz-Daily Case 


In this section a three dimensional version of the Pitz-Daily case is topology optimized. Particular 
emphasis is placed on the scaling behavior of the computation on multiple MPI nodes. 


5.1.1 Case Configuration 


The test case is derived from the 2D Pitz-Daily case, introduced in Section 3.1.2, by extruding it 
by 0.25 m in the z-direction. This case is also used by |AU16] to access the scaling of the standard 
OpenFOAM version on the Hazel Hen' cluster of HLRS Stuttgart. In its coarsest configuration, 
the test case is meshed with a 122 250 cell blockMesh (the 12 225 cells of the 2D case multiplied 
by 10 cells in the z-direction). It can then be refined by uniformly increasing the number of 
cells along each dimension of the geometry. Due to the nature of the blocks for this particular 
test case, the refinement has to be carried out with an integer factor. ‘The number of cells for 
refinement levels 1 to 5 are listed in ‘Table 5.1. 





Table 5.1: Number of cells for different refinement levels of the Pitz-Daily 3D test case 


Refinement Level Cells 
1 122 250 
2 978 000 
3 3 300 750 
4 7 824 000 
5 15 281 250 


The Reynolds number is set to Re = 25000 and the k-e turbulence model is used to obtain a 
non-transient solution. The cost function is implemented as the (total) power loss between inlet 
and outlet, as defined in Section 3.1.3. 


‘https: //www.hlrs.de/de/systems/cray-xc40-hazel-hen/ 
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Figure 5.1: Geometry of the Pitz-Daily 3D test case with blocks (bold black lines) and cells 
(gray lines) for refinement level one. Inflow on the upper left, outflow on the right. The 
geometry is extruded 0.25m in z-direction. 


5.1.2 First Simulations 





To assess the convergence speed of the primal and derivatives, we first run a tangent simulation 
on the base configuration. The initial flow field is set to a solution of the potential flow equation 
(potentialFoam). We then run a tangent solver for 400 iterations. By seeding the tangents of all 
parameters a; to one, we obtain the sum of all pas derivatives 





with one evaluation of the tangent augmented simulation code. The convergence history of the 
sensitivity J is depicted in Figure 5.2. We see no major changes in both the primals and the sum 
of tangents after 1500 iterations. ‘The sensitivities do not lag behind the primals significantly. 
However, the momentum equation lags behind the pressure correction by a few hundred iterations, 
hinting that the momentum equations might be too strongly relaxed. 

After having determined the approximate convergence rate with the tangent simulation, we 
now switch to the adjoint mode. ‘The flow is initialized to a mostly converged state, obtained by 
2500 passive iterations of the simpleFoam solver. We run the adjoint simulation with a uniform 
starting field a, initialized to zero. First we run for 400 iterations, to check if the adjoints match 
the values predicted by the tangent simulations. Figure 5.2 shows that the sensitivities obtained 
by piggybacking match the value obtained by tangent mode. Further the sensitivity obtained 
with reverse accumulation by repeatedly adjoining the 400th iteration step match the results 
obtained by piggy-backing and tangent mode. 


5.1.3 Run Time and Optimization Results 


For the optimization, we utilize the piggyback algorithm (introduced in Section 4.2), which is run 
until both the primals and adjoints are sufficiently converged. 

Table 5.2 lists the run time factors and memory usage of the piggyback simulation. Shown is 
the average run time of a single iteration step of the simpleFoam and piggySimpleFoam solvers. 
By introducing the dco/c++ data type into simpleFoam, without executing an augmentation of 
the forward section, the run time increases by a factor of over two and the memory consumption 
by a factor of 1.6. 
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Figure 5.2: Sensitivities obtained by tangent and adjoint modes, with piggybacking from the 
initial state, and reverse accumulation by repeatedly adjoining time step 400. 


Table 5.2: Global run time and memory, including factors, for the Pitz-Daily 3D case. 


simpleFoam passive 
SimpleFoam active 
piggySimpleFoam w. SDLS 
piggySimpleFoam w.o. SDLS 


Run time(s) Factor Memory(MB) _ Factor 


26.92 1.00 239.14 1.00 
79.08 2.79 381.93 1.60 
218.64 8.12 9979.15 25.00 
463.48 MiGe22 38474.52 160.89 
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Table 5.3: Run time and run time factors (compared to simpleFoam) of the individual solver com- 
ponents. The run times and run time factors for the individual linear equation system solvers 
are given for the augmented forward and reverse interpretation (only for SDLS) execution. 


simpleFoam simpleFoam active Piggyback w. SDLS Piggyback w.o. SDLS 


Run time (s) Runtime (s) Factor Runtime (s) Factor Run time(s) Factor 


Total 26.91 79.81 2.82 213.00 7.92 463.48 7 22 
Augm. forward 25.80 73.80 2.86 134.69 Dee 296.26 11.48 
U 6.39 15.43 2.41 16.78 2.62 62.69 9.80 
p 7.74 19.36 2.50 20.98 Dt 108.65 14.04 
k 2.14 5.10 2.38 5.14 2.40 21.18 9.88 
epsilon 1.68 4.01 2.38 4.65 2.76 16.09 9.57 
Seed — — — 3.47 — 15.42 —— 
Interpretation — — — 66.39 — 143.32 — 
U reverse — — — 16.86 a — — 
p reverse — — — 14.96 — — — 
k reverse — — — 9.01 — — — 
epsilon reverse —_ —_ — 4.52 —— — — 


Enabling the augmented forward and reverse interpretation for the piggySimpleFoam solver 
increases the run time factor to a factor of approximately eight, and the memory factor to 25. 
This is still considerably better than with black-box differentiated linear solvers, which consumes 
double the run time and over six times more memory. 


The run times are broken down to the individual solver phases in ‘Table 5.3. The solver run is 
broken down into the forward phase, seeding phase and interpretation phase. ‘The forward phase 
is further broken down into the individual linear solver calls (velocity, pressure and turbulence 
(k-e)). The seeding phase is dominated by the first allocation and initialization of the adjoint 
vector, and therefore profits from SDLS, due to reduction in size of the adjoint vector. For the 
SDLS case, the interpret phase also breaks down the individual linear solver calls, executed from 
the adjoint callback objects. As with the primal, the solution time of the adjoint equation systems 
are dominated by the velocity and pressure systems, the turbulence equations are comparatively 
cheap to solve. 


When both the primal and adjoint have converged, we update the parameters @ corresponding 
to the steepest descent algorithm. ‘The values for @ are clamped below zero to eliminate non- 
physical momentum sources and capped at a maximum value to obtain a solution which is 
upwards bounded. 





Figure 5.5 shows the convergence history of the cost function 7(a@) over 2000 steepest-descent 
iterations. The average total pressure drop between inlet and outlet drops from 18.74m?s~? in 


the baseline version to 11.49m7s~? in the optimized version. 


The top and bottom part of Figure 5.3 show the velocity field and total pressure contours for 
the baseline and optimized design respectively, while Figure 5.4 shows the distribution of the 
penalization parameter a for the optimized case. The gap in the penalization field in the area 
after the step is an artifact visible at different mesh refinement levels and presumably tries to shift 
some flow from the centerline of the duct more towards the near- and far-field of the geometry. 
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Figure 5.3: Velocity plot on the z-midplane of the initial (top) and optimized (bottom) configu- 
ration. White contour lines show levels of total pressure. 





Figure 5.4: Geometry with penalized regions where a > 0. 
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Figure 5.5: Convergence of the cost function / for the optimization of the 3D Pitz Daily test 
case, improving the predicted power loss by 38%. 
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Figure 5.6: Mesh decomposition for 48 and 192 processors using scotch decomposition. Processor 
boundaries are shown in black, remaining mesh colored by processor id. 


Table 5.4: Relevant Hardware of the RWTH Aachen Compute Cluster. 


Section CLX-MPI CLX-SMP 


LSF Node ‘Type c24m128 c144m1024 

+# Nodes 600 6 
#Sockets per Node 2 8 

CPU Codename Intel Broadwell EP Intel Broadwell EX 
CPU Model E5-2650v4 E'7-8860v4 
Clock Speed (GHz) 2. 22 
##Cores per Chip 12 18 

##Cores per Node 24 144 
Memory per Node (GB) 128 1024 
Memory per Core (GB) 5.33 7.11 


5.1.4 Scaling Behavior on HPC Cluster 


Having established the feasibility of this test case for topology optimization we now use this 
case for benchmarking the scaling behavior of the implementation. The case is decomposed onto 
multiple processors using the ptScotch decomposition algorithm |CP08]. The decomposition for 
48 and 192 processors is depicted in Figure 5.6. 

In Figure 5.8 we show the memory consumption and run time results for refinement level three. 
We again observe a major reduction of tape memory when utilizing the symbolically differentiated 
linear solvers (8088 GB down to 253GB). For not well conditioned problems, which require more 
solver iterations, the memory improvements are even higher, as the memory consumption without 
symbolically differentiated solvers is directly dependent on the number of solver iterations. Also 
one rogue iteration, that needs more linear iterations than average to complete, may kill the 
whole simulation due to lacking RAM space. 

With SDLS we also generally see a reduction in run time, due to improved linear solver efficiency 
(calculation in passive mode), less memory allocation, less adjoint propagation, and less adjoint 
communication, which outweighs the need to solve additional linear equation systems during 
the adjoint propagation. In our case, the adjoint propagation phase is slightly slower due to 
the additional equation systems which need to be solved. This is offset by the more efficient 
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augmented primal section, leading to an overall reduction in run time of roughly 40%. 

Both the run time and the memory consumption are dominated by the solution of the pressure 
equation, for which the geometric-algebraic multi-grid solver (GAMG) is used. The remaining 
equations are solved with a Gauss-Seidel solver variant, which is often used in near transient 
cases due to its stability. 

In Figure 5.7 we show the scaling behavior of our implementation on the RWTH University 
compute cluster. Benchmarked are refinement levels three and four of the test case. Table 5.4 lists 
the two systems types most relevant to the application of AD. They both provide a relatively high 
amount of RAM per core with 5.33 GB and 7.11 GB respectively. With currently 600 installed 
nodes, the former system is the most common node of the RWTH cluster. It supplies 24 cores per 
node and 128 GB of RAM, making it a good general purpose choice. The latter system provides 
1024 GB of RAM per node, making it the preferred choice if a very high amount of RAM is 
needed locally. However, many of the 144 cores need to remain idle to utilize the full memory for 
single threads, which is punished by the job queueing system. 

The simulation was run on the former machines supplying 128 GB each. As the finer simulation 
(level 4) consumes about 650GB RAM (see Figure 5.8), for cases decomposed onto few processors 
(n = {12,24,48,96}), we cannot use all physical cores of the nodes, as not enough RAM is 
available locally. For those cases, we only place 3, 6 and 12 threads respectively per node, leaving 
the remaining cores unutilized. The total available memory bandwidth is thus shared by less 
threads for these cases, at the expense of more communication between distant nodes (connected 
by InfiniBand). 

For n = {192,384,768}, we can fully saturate the nodes with 24 threads each. For the 
benchmark, we time 20 piggyback iterations and calculate the average time needed for both the 
augmented primal and adjoint propagation phases. In the figure we see scaling of both phases. 




















For reference, also the average run time for passive calculation with simpleFoam is shown. For 
the most part, the scaling behavior of the discrete adjoint and passive version are comparable. 
The passive calculation stops scaling earlier than the discrete adjoint, because fewer operations 
need to be performed during each iteration for the passive solver. The run time of the passive 
calculation is thus dominated by the communication overhead earlier. 

For the coarser case of refinement level three, scaling stops after 192 threads for the passive 
computation and after 384 threads for the discrete adjoint. At this point no further scaling is to 
be expected, as the number of cells per thread has already fallen below 10000. For the finer case, 
scaling stops for the passive computation after 192 threads, but continues onto 768 threads for 
the discrete adjoint. At this stage each individual thread only holds around 12000 cells. We thus 
expect the scaling of the discrete adjoint to stop beyond that point for this case as well, as the 
numeric work load for each process becomes too small. 

Scaling of (primal) OpenFOAM has been shown to extend into thousands of processors [Dur+15]. 
For even higher processor numbers, some design decisions limit the scalability |Cull1; AU16]. 
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Figure 5.7: Run time scaling of the discrete adjoint on the RWTH cluster. The average run 
time out of twenty piggyback steps is shown for the recording and interpretation phases. For 
reference, also the scaling behavior of the passive simpleFoam solver and the theoretical ideal 


scaling is shown. 
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Figure 5.8: Memory consumption (over all processes) and run time of the Pitz-Daily example 
for 768 processors with (top) and without (bottom) symbolically differentiated linear solvers. 


5.2 Shape Sensitivities of NACA Airfoil 





In this section we apply the procedures introduced in Section 4.4 to generate the surface sensitivities 
of a NACA airfoil. The flow is calculated using the Spalart-Allmaras turbulence model, the 
sensitivities of the airfoil are evaluated w.r.t. viscous lift and drag. 


5.2.1 Modelling of NACA Airfoils 


The concept of NACA airfoils was introduced by the National Advisory Committee for Aeronautics 
(NACA) in the early 20th century, to efficiently describe different airfoil shapes. The most 
commonly used parametrization is the 4-digit parametrization, e.g. NACA 4412, which parametrizes 
an asymmetric 2D airfoil. A special case is the generation of symmetric airfoil profiles, where 
the distance from the camber line is equal for both the upper and lower surface. ‘Those profiles 
are described by only one parameter t, encoded with two digits, e.g. NACA 0012. Due to the 
symmetric pressure profile at zero angle of attack (that is the angle between spanwise direction of 
the airfoil and the direction of the freestream flow), symmetric profiles produce no lift. 
For a symmetric airfoil, the distance from the camber line is defined as 








Vat) (0.296902 — 0.1260x — 0.351627 + 0.2843a° — 0.10362) | 


giving the coordinates of the upper and lower surfaces as x, = 2, ty = 2, yr = —Yy2(2) 
and yu = +y2(x). 


--- Camber line 
— Upper Surface 
— Lower Surface 





Figure 5.9: Cross section of asymmetric NACA 4412 airfoil (left) and symmetric NACA 0012 (right) 
with zero angle of attack. 
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Figure 5.10: Illustration of the O-type meshing of the airfoil. Zoom on the boundary layers on 
the right. For the actual mesh, the radius of the bounding circle is considerably bigger and 
the boundary layers are finer. 


The wing cross section and the camber line for the asymmetric NACA 4412 and symmetric 
NACA 0012 wings are shown in Figure 5.9. In the following, we will further investigate the 
NACA 0012 airfoil. A symmetric airfoil is useful to verify the adjoint implementation, as one 
expects a symmetric sensitivity field at zero angle of attack. 

An O-type mesh |TWMB85] is generated for a circular domain around the airfoil, with a 
radius of thirty times the wing length. The spacing of the boundary layers yields a y+ value of 
approximately 0.1 on average and a maximum y+ value of 0.25 at the chosen Reynolds number 
of Re = 2- 10°. The angle of attack is varied by changing the direction of the incoming flow. An 
illustration of the mesh layout is shown in Figure 5.10. In order to make the mesh features clearly 
visible, the radius of the circular domain is lowered and the mesh resolution is reduced in the 
figure. ‘he resulting mesh consists of approximately 80000 cells, most of which are needed to 
form suitable boundary layers and to refine the mesh regions in the wake after the trailing edge. 

As cost function the lift and drag of the wing are considered. The lift force on a wing is defined 
as the aerodynamic force of the fluid exerted onto the wing, perpendicular to the freestream 
flow. Analogously the drag force on a wing is defined as the aerodynamic force parallel to the 
freestream flow. 











For low velocities, the aerodynamic force is dominated by the pressure forces, which act normal 
to the skin surface. The total pressure force acting on the wing can be calculated by integrating 
along the airfoil surface I: 


Fy = § ow) n dw. 


For higher velocities, the contribution of viscous effects, that is forces caused by the shear stresses 
between the fluid layers, to the aerodynamic forces become non-negligible. This part of the forces 
is called friction or shear forces F,, which act tangentially to the skin surface: 


F, = $ s(w) t dw , 
f 


where s is the friction force at each location on the airfoil. The calculation formula for the 
friction force depends on the chosen turbulence model, but can in general be obtained from the 
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Figure 5.11: Airfoil with a = 15 deg angle of attack. For the left airfoil, the freestream is aligned 


with the x-axis, for the right airfoil the camber line is aligned with the x-axis. As a result the 
lift and drag vectors are tilted to the coordinate system for the right airfoil. 





implementation of the viscous stress tensor. The total (viscous) aerodynamic force F, = F, + Fs 
is then the sum of pressure and friction force. 

For an angle of attack a, the lift induced by the aerodynamic forces is defined as the pressure 
force projected in the direction perpendicular to the freestream facing upwards: 





— sin(a) 
L, =F,- | cos(a) | , 
0 





and the drag as the force parallel and opposed to the freestream: 


cos(@) 
D, =F, - | sin(a) 
0 


A graphical representation of the drag and lift vectors is given in Figure 5.11. 


5.2.2 Primal and Sensitivity Results 


The pressure distribution along the airfoil surface for three different angles of attack is shown in 
Figure 5.12. The zero surface integral between the pressure on the upper and lower surface of the 
airfoil, for zero angle of attack, shows that no lift is produced by the airfoil in this configuration. 

The sensitivities have been found to strongly depend on a finely resolved boundary layer (yT < 1), 
even more so than the primal pressure distribution. Also the sensitivity fields for drag differ 
fundamentally between viscous and non-viscous formulation of the cost function. ‘This is to be 
expected, because at the chosen Reynolds numbers the viscous drag contributes a considerable 
amount to the total drag. ‘The lift is less influenced by the viscous effects. 

To obtain a simulation state which converges with reverse accumulation, 1.e. a state where 
the adjoint iteration is contractive, the relaxation factors for the primal equations had to be 
slightly lowered. While the primal converges smoothly for under relaxation factors of 0.9 for the 
momentum equations (using the consistent SIMPLEC scheme) and 0.7 for the Spalart-Allmaras 
turbulence equations, convergence of the adjoint could only be obtained after lowering the under 
relaxation factors for the primal equations to 0.8 and 0.6 respectively. That is, the under-relaxed 
field (here for the pressure) p4*! is calculated with under relaxiation factor \ as 


Pu = Api + (1—A)p'. 





The sensitivity results for the drag w.r.t. movement of the surface nodes in surface normal 
direction are shown in Figure 5.14. Negative sensitivities indicate the desire to move the surface 
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Figure 5.12: Pressure distribution along the surface of the NACA0012 wing for zero, two and 
four degree angle of attack. Pressure on lower surface solid and dashed on upper surface. ‘The 
area enclosed by the lower and upper pressure curve corresponds to the total lift of the wing 
(excluding the viscous forces). 


nodes against the airfoil surface normal direction, lowering the cross section of the airfoil. As 
expected, the sensitivities for zero angle of attack are symmetric along the centerline of the airfoil. 
For two and four degree angle of attack, the sensitivities are asymmetric, with the lower chord 
of the airfoil exhibiting higher sensitivity values. Overall the sensitivities are visibly correlated 
to the pressure distribution. The sensitivities are smooth along the airfoil surface, except for a 
singularity point limited to the trailing edge of the geometry. 

Using the Spalart-Allmaras turbulence model, an issue with the calculation of the wall dis- 
tance y*, needed for the calculation of the turbulence equations, was identified. If this calculation 
is left unchanged, it introduces considerable noise into the adjoints, especially near the leading 
edge. If the calculation of the wall distance is treated with a frozen adjoint assumption, the 
adjoints become very smooth, while still retaining the same overall shape. Both the uncorrected 
and corrected sensitivities for a coarse mesh with two degree angle of attack can be seen in Fig- 
ure 5.13. The fix almost completely removes the instabilities observed around the trailing edge of 
the airfoil, only leaving a singularity around the trailing edge. ‘To make sure the correction does 
not completely remove the adjoint sensitivities of the turbulence model from the calculation, the 
same configuration is run with a frozen turbulence assumption. ‘The results of this calculation are 
also shown in Figure 5.13. This reveals that the sensitivities of the airfoil obtained with frozen 
turbulence, while still indicating that the cross section of the airfoil needs to be decreased to 
improve drag, exhibit a completely different behavior. In particular the frozen adjoint solution 





(even with the previous fix to wall distance calculation applied) exhibits issues at the leading 
edge, which do not appear with the fully differentiated Spalart-Allmaras turbulence model. ‘To 
summarize, calculating the wall distance with a frozen adjoint calculation only changes the overall 
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Figure 5.13: Sensitivity of the drag w.r.t. movement of surface nodes of the NACA 0012 airfoil in 
surface normal direction. The noisy green curve is calculated without the modified adjoint 
wall distance, the smooth orange curve with the modification. 


perception of the adjoints a little, while considerably improving smoothness of the adjoints. In 
contrast, introducing a completely frozen turbulence assumption changes the overall appearance 
of the adjoint sensitivities and exhibits additional issues. 

The results for the lift on the same case are presented in Figure 5.15. For zero angle of attack, 
the results are symmetrical around the x-axis, indicating that the top surface needs to be moved 
outwards to increase lift, while the lower surface needs to be pushed inwards. ‘his is consistent 
to the shape of an asymmetric NACA wing. For increasing angle of attack, the sensitivity 
distribution shifts in positive direction. Again the sensitivities exhibit issues at the trailing edge 
and are smooth otherwise. 

The sensitivities can be subsequently used as an input to a mesh morpher, e.g. the one 
introduced in Section 4.4, with the aim to obtain an improved geometry. 
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Figure 5.14: Sensitivities of drag w.r.t. movement of surface nodes of the NACA 0012 airfoil in 


surface normal direction. 
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Figure 5.15: Sensitivities of lift w.r.t. movement of surface nodes of the NACA 0012 airfoil in 


surface normal direction. 
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5.3 Further Applications of Discrete Adjoint OpenFOAM 


Discrete adjoint OpenFOAM has been applied to a variety of different applications by other 
researchers, or by students writing their thesis at STCE. 

The tangent and adjoint model has been used to differentiate through the OpenCascade 
CAD environment, also including the whole OpenFOAM mesh generation procedure, using the 
snappyHexMesh mesher. The solution process is differentiated using the adjointSimpleFoam 
solver. This allows to optimize with parameters defined directly in the CAD environment, by 
using the spline control points of the intermediate NURBS representation as parameters |Gez16]. 

A parametric optimization study has been performed, coupling the sensitivities obtained by 
shape adjoints on a wing shaped rudder, with the proprietary parametric optimization framework 
CAESES by Friendship Systems AG. Studied were the thickness and twist of the wing in a flow 
with low angle of attack |FT16]. 

Some of the concepts presented in this thesis have been applied to the Foam-extend project, 
allowing to run an even wider range of solvers, as well as advanced discretization methods, such 
as explicitly coupled block solvers [STN]. Furthermore, discrete adjoint OpenFOAM has been 
used to obtain transient adjoints in the context of the aboutFLOW project |EU16]. Higher order 
approaches and the accumulation of full Hessians have been explored in [Pee16]. 

The discrete shape optimization approach, using the Spalart-Allmaras turbulence model, 
was used to obtain sensitivities and improve the drag of the student competition solar car 
Sonnenwagen |Mol18]. A shape sensitivity, with respect to the drag of the car in a wind tunnel 
configuration, is shown in Figure 5.16, demonstrating the applicability to complex geometries. 
Similarly, the sensitivities of a flow around the rear wing of a touring car has been studied 
in |Pes16]. 

Discrete adjoint OpenFOAM is available as open-source under the GNU GPLv3?. 








Figure 5.16: Shape sensitivities w.r.t. drag on Sonnenwagen solar competition car geome- 
try |Mol18]. Red cells need to be moved outwards, blue cells inwards. 


* stce.rwth-aachen.de/foam 
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6 Summary & Outlook 


6.1 Summary 


Starting from a black-box application of AD, introduced as a proof of applicability, different 
strategies were explored to lower the runtime and memory impact of the adjoint mode of 
AD. For steady state cases, the fixed point properties of the solution algorithms have been 
extensively exploited, using reverse accumulation, the piggyback method, or the discrete adjoint 
residual approach. 

For the residual approach, different coloring techniques have been implemented to speed up the 
calculation of the Jacobians using either tangent or adjoint mode, exploiting the sparse nature of 
FVM matrices. For transient cases, or if the convergence of the steady primal case is not sufficient 
to reliably obtain adjoints using the aforementioned methods, the whole iteration history can 
be adjoined with low memory overhead, while still retaining acceptable run time behavior, by 
using binomial checkpointing. The differentiated versions of all solution methods, which rely on 
embedded linear solvers, profit immensely from the optimizations implemented with SDLS. 

Per time step, piggyback run time factors under ten can be achieved, when compared to a 
passive primal execution. The memory consumption of the adjoint also generally is increased by a 
factor of around ten. Using reverse accumulation, the run time factor between adjoint and primal 
is lower, as no augmented primal needs to be calculated for each reverse accumulation iteration. 

During the implementation of the adjoint framework, an emphasis was to retain the parallel 
scalability of the primal and adjoint code. In particular this involved the introduction of AMPI 
to the discrete adjoint CFD framework. In addition to providing improved run time behavior, the 
decomposition of cases onto multiple processing nodes spreads the memory demand onto multiple 
machines. This both improves the memory throughput and also removes the demand for machines 
with unusually high amounts of available RAM. Using these capabilities, cases on complex 
meshes with over 7 million cells could be calculated. If, even with all applied optimizations, 
the RAM demand remains higher than what is available in hardware, the dco/c++ tape can be 
offloaded onto secondary storage (preferably low latency, high throughput, such as SSD storage). 
Due to its random access nature, the adjoint vector is retained in RAM. In order to remove 
the bottleneck of the adjoint vector outgrowing the amount of available RAM, adjoint vector 
compression techniques have been discussed and implemented. ‘The improved adjoint vector 
techniques proved to be very effective when adjoining long iteration histories, without majorly 
impacting the run time of the augmented primal and reverse propagation. Having the discrete 
adjoint available for a whole simulation environment proves to be very valuable. While arguably 
not as efficient as an approach tailored specifically to a specific application, it provides much 
more flexibility. 





Another advantage of a tool based AD implementation is the availability of different models of 
differentiation on the same code base. A feature added to the primal is immediately available in 
adjoint, tangent, and higher order versions, by just changing some environment variables. 

The discrete adjoint framework has been applied to a variety of CFD cases, ranging from 
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ducted flows to external aerodynamics. The availability of fully differentiated turbulence models 
makes the discrete adjoint method attractive for applications, where the turbulent quantities are 
believed to have a major impact on the objective. A variety of cost functions, mostly regarding 
external aerodynamic flows, have been implemented. New and blended cost functions can be 
readily implemented, due to the flexibility of the discrete adjoint. Penalization approaches can be 
used to implement basic constraints. Using the on demand compilation features of OpenFOAM, 
this can potentially even be done on a case by case basis, implementing the cost function as part 
of the case configuration. In addition to the high dimensional sensitivities, produced by topology 
and shape optimization, a parametric optimization approach was implemented, demonstrating 
the full AD treatment of a meshing tool. 


6.2 Outlook 


The discrete adjoint approaches presented in this thesis can be applied to a variety of different 
applications, either within the OpenFOAM framework, or in the general CFD field. The 
application to large scale transient simulations remains challenging, but also has the potential 
to create results not obtainable by other methods. Furthermore, for transient applications, the 
continuous adjoint approach faces some of the same challenges as the discrete adjoint, closing the 
performance gap. With the advances in the file tape and parallel adjoints made in this thesis, the 
computation of sensitivities for very large and complex geometries becomes feasible. With the 
developed profiling methods, further avenues of optimization for the AD implementation should 
be identified and implemented. In addition to the already employed forward activity analysis, 
additional optimizations can be applied to the tape, improving performance, especially if the tape 





is evaluated multiple times. 

To expand the capabilities of the discrete adjoint framework, the application to multi-physical 
optimization is promising. The flexibility of the discrete adjoint makes it applicable to a wide 
variety of simulation approaches and code bases already in existence, without requiring major 
code rewrites. A planned development is the incorporation of the discrete adjoint into coupled 
heat transfer (CHT) problems. 

Another advanced topic is the incorporation of robust optimization, to obtain designs which 
are feasible under a variety of operating conditions. The availability of higher order adjoints 
facilitates the usage of sophisticated robust optimization schemes. 

More complex constrained optimization methods should be explored, in order to generate 
solutions which can be produced cost effectively (design to manufacture), or which adhere to 
certain design constraints (e.g. fixed amount of volume). The already implemented parametric 
optimization capabilities could be used to directly couple the simulated geometries to their CAD 
representations, allowing for a more construction driven optimization approach. 

Besides optimization problems, other applications of the adjoint methods are conceivable, and 
already pursued by other researchers, using the discrete adjoint OpenFOAM framework. For 
example, the application of adjoint error estimators for adaptive mesh refinement, or the usage of 





adjoints for uncertainty quantification. 
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A.1 Location of Discrete Adjoint OpenFOAM Solvers 


Discrete adjoint OpenFOAM strives to integrate into the native OpenFOAM framework 
as seamlessly as possible. ‘The general folder layout remains unchanged from the native 
OpenFOAM implementation. Let $FOAM_INST_DIR point to the OpenFOAM base direc- 
tory, then the solvers unique to discrete adjoint OpenFOAM are located in $FOAM_INST_ 
DIR/applications/discreteAdjointOpenFOAM. Adjoint solvers are located in $FOAM_INST_ 
DIR/applications/disreteAdjointOpenFOAM/adjoint, tangent solvers in $FOAM_INST_DIR/ 
applications/disreteAdjointOpenFOAM/tangent, passive solvers (e.g. for finite differences) 
in $FOAM_INST_DIR/applications/disreteAdjointOpenFOAM/passive, and support libraries 
(for checkpointing and the calculation of cost functions) in $FOAM_INST_DIR/applications/ 
disreteAdjointOpenFOAM/libs. The dco/c++ header files are located in $FOAM_INST_DIR/src/ 
OpenFOAM/dco. 

The most relevant subdirectories for the usage and configuration of discrete adjoint OpenFOAM 
are listed below: 


. $FOAM_INST_DIR 
applications 
discreteAdjointOpenFOAM 
adjoint 
experimental 
fd 
libs 
-— libCostfunction 
libCheckpointing 
passive 
tangent 
tools 
solvers 
etc 
bashrc 
derivativesSettings.sh 
Src 
|__ OpenFOAM 
L___ dco 
|__ dco.hpp 
tutorials 
wmake 


Figure A.1: Folder structure of discrete adjoint OpenFOAM, with root at $FOAM_INST_DIR. 
Folders not immediately relevant are omitted for space reasons. 
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A.2 Compile Options 


Discrete adjoint OpenFOAM uses the wmake build system of OpenFOAM. The environment of 
OpenFOAM is set by sourcing the etc/bashrc script file. Discrete adjoint OpenFOAM extends 
the environment variables of OpenFOAM (identifiable by the prefixes WM_* and FOAM_*) with a 
set of variables with the prefix DOF_*. 

The following variables are currently implemented: 





DOF_AD_OPTION: Choice of AD option to be used. 
One of {A1S | T1S | T1V | T2A1S | T2T1S | Passive }, default: A1S. 


DOF_COMPILER: Choice of compiler, gets passed along to WM_COMPILE_OPTION. 
One of {Gcc | Clang | Icc}, default: Gcc. 


DOF_COMPILE_OPTION: Specify level of optimization. 
One of {Opt | Debug | Prof } default: Opt. 


DOF_BUILD_PROCS number of threads on localhost for parallel compilation, default: 8. 


The flags for dco/c++ are set in the environment variable DOF_DCO_FLAGS. The DOF_DCO_FLAGS 
are set by the bashrc script, depending on the choice of DOF_AD_OPTION. ‘The flags are added to 
the compilation flags and configure dco/c++, such that only the needed features are instantiated. 
The choice of flags is detailed in the following subsections. 





A.2.1 Passive Mode 


For passive mode, all non-essential features of dco/c++ are disabled. Neither adjoint nor tangent 
data types are available. To enable the use of the same code base for all variants, some dco/c++ 
functions such as dco: :passive_value() are still used. However they should be optimized out 





by the compiler. 


export DOF_DCO_FLAGS="-DDCO_NO_DEFAULT" 


A.2.2 First order adjoint mode (A1S) 


For adjoint mode, the generic adjoint type (gaits) is instantiated, allowing the definition of 
Foam: :scalar as dco: :gais<double>: :type. A chunk tape is used, allowing the tape to gradually 
grow in chunks. ‘Tape callbacks are needed for the symbolic differentiation of linear solvers. Activity 
analysis is enabled, reducing the tape size for regions not dependant on the registered inputs. 
For cases which require tape indices bigger than 2°!, a 64 bit datatype has to be requested to 
enumerate the tape entries with DDCO_TAPE_USE_LONG_INT. 








export DOF_DCO_FLAGS="\ 
-DDCO_NO_@EF SOLT™\ 
-DOF_DCO_MODE_A1S \ 
-DOF_DCO_A1S_CONT_LINEAR \ 
-DDCO_GA1iS \ 
-DDCO_CHUNK_TAPE \ 
-DDCO.TAPE_CALLBACKS \ 
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-DDCO_TAPE_USE_LONG_INT \ 
-DDCOLTAPEVACTIVITY \ 
-DDCOLALLTOW STARE SWITCH OEE 


A.2.3 First Order Tangent Scalar Mode (T1S) 


For scalar tangent mode, the generic tangent type (gt1s) is enabled, allowing the definition of 
Foam: :scalar as dco: :gtis<double>: : type. 


export DOF_DCO_FLAGS="\ 
-DOF_DCO_MODE_T1S \ 
-DDCO_NO_DEFAULT \ 
“DWCOSGTIS  \ 
=-DPDCO_TIS_ ACTIVITY” 


A.2.4 First Order Tangent Vector Mode (T1V) 


For the vector tangent mode, the generic tangent type (gtiv) is enabled, allowing the definition 
of Foam: :scalar as dco: :gtiv<double,d>::type. The vector size d is set to 5 and is fixed at 
compile time for performance reasons. If another vector size is required, it can be changed in 
DCO_T1V_SIZE. All object files have to be recompiled in order to apply the changes to the vector 
size. ‘The optimal vector size is dependant on cache behavior, the amount of available RAM and 
the number of needed derivatives. A reference to the i-th tangent of a variable x can be accessed 
by using the dco: :derivative(x) Li] method. 


export DOF_DCO_FLAGS="\ 
-DOF_DCO_MODE_T1V \ 
-DDCO_NO_DEFAULT \ 
-DDCO_GT1V \ 
=DOCOLTIVOACTIVITY \ 
-DDCO_VECTOR_SIZE=5" 


A.2.5 Second Order Tangent Over Adjoint Mode (T2A1S) 


For the second order tangent over adjoint mode, both the generic adjoint and tangent types are 
instantiated. Those types can be arbitrarily nested to obtain higher order derivative models. ‘The 
scalar tangent over adjoint mode is obtained by nesting an tangent type inside an adjoint type, 
yielding dco: :gais<dco: :gt1s<double>: :type>: : type. 


export DOF_DCO_FLAGS="\ 
-DOF_DCO_MODE_T2A1S \ 
-DDCO_NO_DEFAULT \ 
-DOF_DCO_A1S_CONT_LINEAR \ 
-DDCO 4G 1 Sm \ 
-DDCO_GA1S \ 
-DDCO_TAPE_CALLBACKS \ 
-DDCO_TAPE_USE_LONG_INT \ 
-DDCO_TAPE_ACTIVITY \ 
-DDCO_ALLOW_TAPE_SWITCH_OFF" 
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A.2.6 Second Order Tangent Over Tangent Mode (T2T1S) 


The (scalar) second order tangent over tangent mode nests a tangent type inside another tangent 
type, yielding dco: : gtis<dco: :gtis<double>::type>::type. The flags are identical to first 
order tangent scalar mode. 


1,export OF_DCO_FLAGS="\ 
2 “DOFZDCOSMODEST2T1S \ 
3 -DDCO_NO_DEFAULT \ 

4 -DDCO_GT1iS \ 

5 =DPDCU_LTISLACTIViITY"™ 
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1} template <> 
2 Foam::solverPerformance Foam::fvMatrix<Foam::scalar>::solveSegregated 


3 


oO OAN DOD oO 


11 
12 
13 
14 
15 
16 
Ly 


18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
A7 


( 


) 
{ 


const dictionary& solverControls 


GeometricField<scalar, fvPatchField, volMesh>& psi = 
const_cast<GeometricField<scalar, fvPatchField, volMesh>&>(psi_); 


scalarField saveDiag(diag()); 
addBoundaryDiag(diag(), 0); 
scalarField totalSource(source_); 
addBoundarySource(totalSource, false); 


ADmode::global_tape->switch_to_passive() ; 


auto* D = ADmode::global_tape->create_callback_object<ADmode:: 
external_adjoint_object_t>(); 


int nu = 0; if(this->hasUpper()) nu this->upper().size(); 
TINE gnKel this->diag().size(); 


O; if (this->hasLower()) nl = this->lower().size(); 


int nl 


// register rhs as adjoint inputs 
forAll(totalSource, i) 
D->register_input (totalSource[i]); 


// register matrix coefficients as adjoint inputs 
if (this->hasUpper () ) 
for(int i= 0; i < nu; i++) 
D->register_input (this->upper() Li]); 


if (this ->hasLower @® 
for(int i = 0;.8 <@ml; i++) 
D->register_input (this->lower() Li]); 


for(int i= 0; i < nd; itt) 
D->register_input (this->diag()Li]); 


// register boundary coefficients if on parallel boundary 
forAll (psi. boundaryField() ,i) 
if (psi.boundaryField().types() [i] == "processor") 
forAll (boundaryCoeffs_ [i] ,j) 
D->register_input (boundaryCoeffs_[i]lj]); 


solverPerformance solverPerf = lduMatrix::solver::New 


( 


psi.name(), 
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A8 *this, 

49 boundaryCoeffs_, 

50 internalCoeffs_, 

51 psi_.boundaryField().scalarInterfaces(), 

52 solverControls 

53 )->solve(psi.primitiveFieldRef(), totalSource) ; 

54 

55 D->write_data(psi.name()); 

56 D->write_data(Foam::direction(-1)); // dummy direction 


57 D->write_data(*this); 
58 D->write_data(psi) ; 


59 

60 forAll(psi.primitiveField() ,i) 

61 psi.primitiveFieldRef() Li] = D->register_output (dco::passive_value((psi. 
primitiveFieldRef()[il))); 

62 


63 ADmode:: global_tape->insert_callback<ADmode::external_adjoint_object_t >( 
symbolic::fil1SolverGap<Foam::scalar>,D); 
64 ADmode:: global_tape->switch_to_active() ; 


65 

66 diag() = saveDiag; 

67 psi.correctBoundaryConditions () ; 
68 } 


Listing B.1: fvMatrix solve with creation of solver gap and checkpoints of the necessary data. 
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1} template<class Type> 
2, void fillSolverGap(typename Foam::ADmode::external_adjoint_object_t *D){ 
const Foam::word& fieldName = D->read_data<Foam::word>(); 

const Foam::direction& cmpt = D->read_data<Foam::direction>() ; 


3 


oO On Dn oO 


ml 
12 
13 
14 
15 
16 
Ve 
18 
19 
20 


21 


22 
23 
24 
25 
26 
ae 
28 
29 
30 
31 
32 
33 
34 
35 
36 
37 
38 
39 
40 
41 
42 
43 
44 
45 
46 
AT 
48 


49 


const Foam::fvMatrix<Type>& A = D->read_data<Foam::fvMatrix<Type> >(); 


const Foam::volScalarField& x_ref 


Foam::scalarField ai_x(x); 
forA ll Calon. i) 


ai_x[i] = D->get_output_adjoint(); 


Foam::fvMatrix<Type> A_T(A); // will hold transpose 
Foam::label nu = 0; if(A.hasUpper()) nu = A.upper () 
Foam::label nl = 0; if (A.hasLower()) nl = A.lower() 


Foam::label nd = A.diag().size(); 


/7/ component will return this for 

Foam::FieldField<Foam::Field,Foam: 
component (cmpt) (); 

Foam::FieldField<Foam::Field,Foam: 
component (cmpt) () ; 


// transpose matrix if necessary 
bool sym = A.symmetric(); 
if (!sym){ 
A_T.lower() = A.upper(); 
A_T.upper() = A.lower(); 


scalar field 
;:scalar> bcmpts % 


;scalar> g@empts = 


D->read_data<Foam::volScalarField>(); 
const Foam::scalarField& x = x_ref.primitiveField(); 


// read incoming adjoints from tape 


of A 


. Seaze () : 
Bsize () 4 


A_T.boundaryCoeffs(). 


A_T.internalCoeffs(). 


// switch boundary and internal coeffs for transposed 
icmpts = A_T.boundaryCoeffs().component (cmpt) (); 
bempts = A_T.internalCoeffs().component (cmpt) (); 


} 
Foam::word reverseFieldName = fieldName + Foam::word("Reverse") ; 
Foam::volScalarField ai_b(reverseFieldName ,x_ref); 
const dictionary& reverseSolverControls = ail_b.mesh().solverDict(L...]); 
Foam::lduMatrix::solver solver = Foam::lduMatrix::solver:: New 
( 

fieldNameCmpt + Foam::word("Reverse"), 

A_T, 

bempts, 

icmpts, 


ai_b.boundaryField().scalarInterfaces(), 


reverseSolverControls 


Di 


// solve Wor bi 
Foam::solverPerformance solverPerf 
ql a) : 


lee. ey / Bontinued in next listing 


= solver->solve(ai_b.primitiveFieldRef (), 


Listing B.2: Adjoint callback routine, solution of the adjoint system. 
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1 L...] // continuation of previous listing 

2 // increment input adjoint for b 

3 for (imt 1 =) O07 1. < alli besize ©) i++) 

4 D->increment_input_adjoint (dco::passive_value(ai_blLi])); 
5 

6 // Adressing for upper and lower half of matrix 

7 const Foam::label* uPtr = A_T.1lduAddr().upperAddr().begin() ; 
8 const Foam::label* 1Ptr = A_T.1lduAddr().lowerAddr().begin() ; 
9 

10 // increment input adjoint for A (upper part) 

11 double tmp = QO; 

i if (A. hasUpper()){ 

13 for(int i = 0; i < nu; itt+)f{ 

14 tmp = dco::passive_value(-ai_b[1Ptr[iJl]*x[uPtr[li]]); 
15 re CUPte a = UPtr tee sym) 

16 tmp += dco::passive_value(-ai_b[uPtr[li]]*x{[1Ptr[li]]); 
i D->increment_input_adjoint (tmp) ; 

18 } 

19 = Ss 

20 // increment input adjoint for A (lower part) 

21 if (A.hasLower()){ 

22 for€int i = 0; i < nl; i++)f{ 

23 tmp = dco::passive_value(-ail_b[uPtr[i]]*x[1Ptr[lil]); 
24 Te(IPtr il '= uPtrial) &e sym) 

25 tmp += dco::passive_value(-ai_b[1Ptr[li]]*x[uPtr[li]]); 
26 D->increment_input_adjoint (tmp) ; 

27 } 

23, 

29 // increment input adjoint for A (diag part) 

30 for€int i = 0; i < nd; itt) 

31 D->increment_input_adjoint (-a1l_b[li]*xli]); 

32 

33 Foam::volScalarField xSF(x_ref); 

34 forAll(ai_b.boundaryField() ,i)f 

35 if (al_b.boundaryField() .types @ MA] -= "processor"”){ 

36 ai_b.boundaryFieldRef () Li].initEvaluate() ; 

37 ai_b.boundaryFieldRef () Li].evaluate() ; 

38 xSF.boundaryFieldRef () Li].initEvaluate () ; 

39 xSF.boundaryFieldRef () [i].evaluate() ; 

40 

41 forAl1(A_T.boundaryCoeffs()lil,j)f 

42 Foam::Field<Foam::scalar> x_other = 

43 xSF.boundaryField() [i].patchNeighbourField() (); 
44 Foam::Field<Foam::scalar> al_b_this = 

45 ai_b.boundaryField() [i].patchInternalField() (); 
46 tmp = dco::passive_value(x_other[j]*ai_b_this[j]); 
AT D->increment_input_adjoint (tmp) ; 

48 } 

49 i 

50| + 

51| } 


Listing B.3: Continuation of adjoint callback routine, incrementation of input adjoints. 
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C Reference Solver and Case 


C.1 Reverse Accumulation for Incompressible Steady Flows 


(fa 


#include 
#include 
#include 
#include 
#include 


#include 
#include 
#include 


GPLv3 License Header 


LEO r Dee ls 
"“singlePhaseTransportModel.H" 
"turbulentTransportModel.H" 
MsimpleControl ." 
Livapttouc. Hy 


"CheckInterface.H" 
rCheckDaret i 
PEOcoPUnGLToulibprary “iH 


13, int main(int argc, char x*argv[]) 


14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
35 
36 
a7 
38 
39 
40 
Al 
42 
43 
AA 
45 


#include 
#include 
#include 


"setRootCase.H" 
"createTime.H" 
"createMesh.H" 


simpleControl simple(mesh) ; 


#include 
#include 
#include 


#include 


enearer ae ide 
"createFvOptions.H" 
MinitContinuitybrrs , Te 


"adjointSettings @" 


CheckInterface check(runTime) ; 
CheckDict checkDict (&runTime) ; 


CheckDatabase checkDB(&runTime ,&checkDict) ; 


turbulence ->validate(); 


dco::gais<double>::global_tape 


auto zero_to = global_tape->get_position() ; 
auto interpret_to = global_tape->get_position(); 


bool frozenTurbulence = false; 
scalar adjointEps = ie-6; 


while (simple.loop()){ // iterate passive until primal convergence 
Pressure-velocity SIMPLE corrector 


c/ <- 


#include "UEqn.H" 
#include "pEqn.H" 


dco::gais<double>::tape_t::create(); 
autok& global_tape = dco::gais<double>::global_tape; 


global_tape->switch_to_passive(); // iterate passive until 


primal convergence 
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46 laminarTransport.correct() ; 

AT turbulence ->correct(); 

48 

49 if (simple.criteriaSatisfied()) 
50 break; 

51 } 

52 

53 // run and tape one active step 


54 global_tape ->switch_to_active(); 

55 global_tape ->register_variable(alpha.begin() ,alpha.end()); 
56 

57 zero_to = global_tape->get_position(); // don't zero alphas 
58 checkDB.registerAdjoints(); // register state 

59 interpret_to = global_tape->get_position() ; 

60 

61 #include "UEqn.H" 

62 #include "pEqn.H" 


63 laminarTransport.correct () ; 

64 if (frozenTurbulence) 

65 global_tape->switch_to_passive() ; 

66 turbulence ->correct(); 

67 global_tape ->switch_to_active(); 

68 

69 J = CostFunction(mesh).eval(); // eval cost, seed later 
70 checkDB.registerAsOutput () ; 

cai global_tape->switch_to_passive(); // no more recording needed 
72 

7 Mb ber = O° 


74. scalar eps = Foam::GREAT; 
75 while(eps > adjointEps ) 


76| 

ra if(iter == 0 && Pstream::master () ) 

78 dco::derivative(J) = 1; 

79 e lee 

80 checkDB.restoreAdjoints() ; 

81 

82 global_tape ->interpret_adjoint_to(interpret_to) ; 

83 checkDB.storeAdjoints () ; 

84 

85 static scalar firstEps = checkDB.calcNormO0fStoredAdjoints(); 
86 eps = checkDB.calcNorm0fStoredAdjoints() / firstEps; // normalize eps 
87 global_tape ->zero_adjoints_to(zero_to) ; 

88 aie te tte 

sos $ 

90 forAll(alpha,i) // extract sensitivities after reverse acc converged 
91 sens[i] = dco::derivative(alphali]) / mesh.V() (il; 

92 runTime.write(); 

93 dco::gais<double>::tape_t::remove(global_tape) ; 

94 return 0; 

95 } 

96 


O7 Vidi 2K OK OK OK OK OK OK OK OK OK OK KOK KOK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK KOK KOK OK OK OK OK OOK OK OK OK OK OK OK OK OK OK OK KK KOK OK OK OK OK OK OK OOK OK OK OK OK OK OK OK OK OK KK as 


Listing C.1: Source of reverseAccSimpleFoam 
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C.2 Example Case Configuration 


1 dimensions [eral =a) eh I) = 

2 internalField uniform (0 O 0); 

3 

4 boundaryField{ 

5 inlet{ 

6 type fixedValue; 

7 value uniform (100 O 0); 
8 } 

9 walls{ 

10 type fixedValue; 

11 value uniform (0 0O 0); 
12 } 

13 outlet{ 

14 type zeroGradient ; 

15 } 

16 defaultFaces{ 

17 type empty ; 

18 } 

19 } 


Listing C.2: Velocity boundary conditions 0/U 


1 dimensions [On2 2 00 0n Ole 
2 internalField uniform 0; 

3 

4 boundaryField{ 

5 inlet{ 

6 type zeroGradient ; 
7 } 

8 walls{ 

9 type zeroGradient ; 
10 } 

11 outlet{ 

12 type fixedValue; 
13 value uniform 0; 

14 } 

15 defaultFaces{ 

16 type empty ; 

17 } 

is, } 


Listing C.3: Pressure boundary conditions 0/p 
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C’ Reference Solver and Case 


transportModel Newtonian; 
nu nu [10.2 =1. 0-00 0] 1; 


Listing C.4: Viscosity and transport model constant/transportProperties 


SimulationType laminar ; 


RAS 

{ 
RASModel kEpsilon; 
turbulence OGt: 
printCoeffs on ; 

t 


Listing C.5: Turbulence model constant/turbulenceProperties 


checkpointSettings{ 
checkpointingMethod revolve; // equidistant, revolve or none 
nCheckpoints 10; 
nTapesteps le; 

t 


checkpointRequired{ 
U; 
Pp; 
phi: 

} 


Listing C.6: General settings system/checkpointingDict 
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C.2 Example Case Configuration 


/*-------------------------------- *- C++ -*---------------------------------- *\ 
| SSSSSS5== | | 
OSs / F ield | OpenFOAM: The Open Source CFD Toolbox | 
| \\ i O peration | Version: plus | 
| Se A nd | Web: www.OpenFOAM.com | 
| wy M anipulation | | 
se ae en eee ee ee ee ee fee * / 


FoamFile{ [...] } 


application 
startFrom 


SimpleFoam; 
startTime; 


startTime OF 
stopAt endTime; 
endTime bo: 
deltaT ile 
writeControl timestep; 
writeInterval 100; 
purgeWrite OF 
writeFormat ascil; 
writePrecision 6; 
writeCompression off; 
timeFormat general; 
timePrecision Oe 
runTimeModifiable true; 

Listing C.7: system/controlDict 
/*-------------------------------- *- C++ -*---------------------------------- *\ 
| ssse=s=s== | | 
| AX / F ield | OpenFOAM: The Open Source CFD Toolbox | 
EONS i, O peration | Vegrstion : 3.0.x | 
| ee A nd | Web: www.OpenFOAM.org | 
| Ne, M anipulation | | 
Vl SS 9 5 5 9 8 5 5 5 8 6 5 5 9 8 5 5 5 5 5 5 5 9 5 5 5 5 5959 5 59 9 5 5 5 * / 
PoemMeadbe, [bag oll 
ddtSchemes{ default steadyState; } 
gradSchemes{ default Gauss linear; } 
divSchemes{ 

default none; 


div (phi ,U) 


t 


bounded Gauss upwind; 
div ((nuEff*dev2(T(grad(U))))) Gauss linear; 


laplacianSchemes{ default Gauss linear corrected; } 


interpolationSchemes{ default linear; 


snGradSchemes{ default corrected; } 
wallDist{ method meshWave; } 


t 


Listing C.8: system/fvSchemes 
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C’ Reference Solver and Case 


/*-------------------------------- *- C++ -*---------------------------------- * \ 
| SSSSS===5 | | 
|X / F ield | OpenFOAM: The Open Source CFD Toolbox | 
| \\ i O peration | Version: plus | 
| Wy A nd | Web: www.OQOpenFOAM.com | 
| VW M anipulation | | 
Wl SS 5 5 8 5 5 5 5 5 5 59 5 5 8 5 5 9 59 5 5 5 9 5 5 5 Se SS * / 


FoamFile{ [...] } 
J / % % HR ke eR ek ke ke ke Ok Ok OK OR ke OR OR Ok aM ee ok Cw / / 


SDLS yes; // global flag to enable or disable SDLS 


solvers 
i 
"C.*)" // catch all equations 
iL 
solver smoothSolver ; 
smoother symGaussseidel ; 
tolerance le-05; 
relTol 0; 
SDLS SS SIDES) & 
J; 


"(plpReverse)" // specialize for pressure correction equation 


{ 


solver GAMG ; 
tolerance 1e-06; 
relTol 0; 
smoother DEG; 
SDLS $SDLS ; 
t 
t 
SIMPLE 
{ 
nNonOrthogonalCorrectors 0; 
consistent yes; 
costFunctionPatches (inet % 
costFunction "“pressureLoss"; 
adjointEps te-5; // treshold for reverse acc / piggyback 
t 
relaxationFactors 
{ 
equations 
{ 
U 0.9; // 0.9 is more stable but 0.95 more convergent 
a 0.9; // 0.9 is more stable but 0.95 more convergent 
t 
t 


Listing C.9: system/fvSolution 


2S 


Oo On rn on FF Ww NY 


NO fF fF fe Be Be Be Be Se eS ep 
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21 


22 


23 


24 


25 


27 
28 


[Pe 2o2 esse ee ee ee lee Se eee ee eee sees ee 

| Seessscas | 

| \\ / eee nike) | OpenFOAM: 

ee a. ii O peration | Version: 17 7-1 

| Ne A nd | Web: 

| NW M anipulation | 

ee ee ee ee Se eae 

FoamFile{ [...] } 

eck ee tc ie cee ike ok 

convertToMeters 1; 

vertices ( 
(O 0 0) (2 0 0) (4 0 0) (5 0 0) (0 1 0) (2 1 
Sy al 0) eh tO) CO) CES 8) Cs 0) CS 
COO C2 Oa) 4 Oa) C5 Oe) en COM Le) C2 al 
Comte) eG2 = Sarl aC St CS a) C4 eso) Co 5 

); 

m4_define(lvl ,1) 

blocks ( 
hex (0 15 4 13 14 18 17) ( m4_eval (lvl1*2) 


hex 
hex 
hex 
hex 
hex 


oe 


simpleGrading (1 1 1) 
simpleGrading (1 1 1) 
simpleGrading (1 1 1) 
simpleGrading (1 1 1) 
simpleGrading (1 1 1) 


simpleGrading (1 1 1) 


(1265 14 15 19 18) 


CPE 16 5) i as 0 1) 


COyGeIoels 195225721) 


(657 210 9192072322) 


(9 10 12 11 22 23 25 24) ( 


patches ( 
patch inlet ( 


( 
) 
wal 

( 

( 

( 
) 


patch outlet ( 


0 


1 
0 
3 
5 


4 17 13) 


walls ( 

i abe aL 61) 
7 20 16) 
8 21 18) 


(1 2 15 14) 
(7 10 23 20) 
(8 9 22 21) 


(11 12 @5 24) 


) 
Pe 


C2 
(10 
Cs 


m4_eval(lv1l*2) 


m4_eval(lvl1*1) 


m4 _eval(lvl*2) 


m4 _eval(lvl*1) 


m4_eval(lvl*i) 


3 16 15) 
12) 257723) 
11 24 22) 


0) 
0) 
1) 
1) 


C.2 Example Case Configuration 


The Open Source CFD Toolbox 


www.OpenFOAM.com 


(4 1 0) 


(4 1%) 


m4 eval (lvl1l*1) 


m4_eval(lvl*1) 


m4_eval(lvl*1) 


m4 eval (lvl1l*2) 


m4_eval(lv1l*2) 


m4_eval(lv1l*2) 


(4 5 18 17) 


* ok ok Kk ok ok ok AP ek kk Kk Me ke Ok // 


Jf RO We ee ee oR oR ee ee ke ce ee ae 7 


Listing C.10: system/blockMeshDict 
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#Hinclude <iostream> 
#include <vector> 
#include <cmath> 


typedef std::vector<double> adjoint_vector ; 


struct partial_edgef 
partial_edge(double v=0, int i=-1) : partial_val(v) ,target_idx(i){} 
double partial_val; 
int target_idx; 


tee 


struct adjoint_stack : std::vector<std::vector<partial_edge>>f{ 
int longest_edge; 
int n_params; 

}; 


adjoint_stack s; 


struct adjoint_typef 
double val; 
int tape_idx; 
adjoint_type(const double& x): val(x), tape_idx(-1) {} 


void operator=(const adjoint_type& x)f{ 


thve --val = x. val; 
std::vector<partial_edge> p(1,partial_edge(1.0,x.tape_idx)); 
this->tape_idx = s.size(); 
s.emplace_back(1,partial_edge(1.0,x.tape_idx)); 
} 
lee 


void register_variable(adjoint_type& adt)f{ 
S.n_paramst+; 
s.emplace_back(); // push empty entry to tape 
adt.tape_idx = s.size() -1; 


t 


Listing D.1: Data structures for tape and partial edges; Interface. 


adjoint_type operator*(const adjoint_type& xi,const adjoint_type& x2){ 
adjoint_type y(x1.val*x2.val); 
std::vector<partial_edge> partials; 
if(x1.tape_idx >= 0) partials.emplace_back(x2.val,x1.tape_idx) ; 
if (x2.3apemidyy >= 0) partials.emplace_back(x1.val,x2.tape_idx) ; 
y.tape_Wax ="s.size() ; 
s.push_back(partials) ; 
requrawyy,; 
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t 


adjoint_type operator+(const adjoint_type& xi,const adjoint_type& x2){ 


adjoint_type y(x1.val+x2.val); 

std::vector<partial_edge> partials; 

if (x1.tape_idx >= 0) partials.emplace_back(1.0,x1.tape_idx) ; 
if (x2.tape_idx >= 0) partials.emplace_back(1.0,x2.tape_idx) ; 
y.tape_idx = s.size(); 

s.push_back(partials) ; 

ieee binge i 


t 


adjoint_type operator/(const adjoint_type& xi,const adjoint_type& x2){ 


adjoint_type y(x1.val/x2.val); 

std::vector<partial_edge> partials; 

if (x1.tape_idx >= 0) partials.emplace_back(1.0/x2.val,x1.tape_idx) ; 

if (x2.tape_idx >= 0) 
partials.emplace_back(-x1.val/(x2.val*x2.val) ,x2.tape_idx) ; 

y.tapelidx = S.size() ; 

s.push_back(partials) ; 

Reuulrm yy; 


Listing D.2: Operators 


int find_longest_edge(const adjoint_stack& s){ 
int max_length = 0; 
tom (int a — S.6ize()—i. 1>=071-_) 
for(const partial_edge& p: sLil]) 
if (p.target_idx >=s.n_params) 
max_length = std::max(max_length ,i-p.target_idx) ; 
return max_length; 


t 


void interpret_tape(const adjoint_stack& s, adjoint_vector& v){ 
for(int i = s.size()-1; i>=0;i--) 
for(const partial_edge& p : sLil]) 
vlLp.target_idx] += p.partial_val * vl[il]; 
} 


void interpret_tape_mod(const adjoint_stack& s, adjoint_vector& v)f{ 
const int nC = v.size(); 


const int nP = §.m_params ; 

const int 1 = s.longest_edge; 

for(int i = 0; i<(€s.size()-nP);i++){ 
const int ti = s.size()-1-i; // position in tape 
const ime j = n@1-€i 4% (1+1)); // position in mod adjoint vector 
std::vector<partial_edge> partial_edges = s[til; 


for(partial_edge& p : partial_edges){ 
if(p.target_idx >= 0O)f{ 

me (pemtarcet_idx < nP){ 
vi[p.target_idx] += p.partial_val * vIljl; 

selse{ 
@Pnst int tot — <Cj-ne)-(ti-p, tancetsidx)+(lti)), Cla) tur 
v[tgt] += p.partial_val * vIjl]; 

i} 
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34 } 

35 v[j]l = 0; // reset adjoint after all tape edges adjoined 
36} 

37| } 


Listing D.3: (Compressed) Tape interpretation 


1,int main(){ 

2 adjoint_type a = 2; 

3 register_variable(a) ; 

4 adjoint_type x = a; 

A) Ce nig og, > OR AS) e alae) al 
6 x = 0.5*(a/xtx) ; 

7 

8 

9 


} 
std: scout << “f(x) —= "<< x val << std::endl: 
10 { // interpret compressed 
11 s.longest_edge = find_longest_edge(s) ; 
12 adjoint_vector av(s.n_params + s.longest_edget1l) ; 
13 avlavecize ()-1)) = 91.50 7/7. seed 
14 interpret_tape_mod(s,av); 
15 std::cout << "df/da = " << avla.tape_idx] << std::endl; 
16 = Sg 
iy { // interpret full 
18 adjoint_vector av(s.size()); 
19 avlavecize()=1)) = 9120-77 seed 
20 interpret_tape(s,av); 
21 std::cout << "df/da = " << av[a.tape_idx] << std::endl; 
22 } 
pe return 0; 
24| } 


25 |e, mente ona 

26|// £(x) = 1.41422 
27 (t/t diate — 0 358566 
2s|// df/da = 0.353566 


Listing D.4: Driver calculating derivative of approximation to ,/x using full and compressed 
adjoint vector representations. 
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Aerodynamic Simulation of a 2017 F1 Car with 


Open-Source CFD Code 


Umberto Ravelli and Marco Savini 
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Abstract: Open-wheeled race car aerodynamics is unquestionably challenging insofar as it involves many physical phenomena, such 


as slender and blunt body aerodynamics, ground effect, vortex management and interaction between different sophisticated aero 


devices. In the current work, a 2017 Fl car aerodynamics has been investigated from a numerical point of view by using an 


open-source code. The vehicle project was developed by PERRINN (Copyright©2011—Present PERRINN), an engineering 


community founded by Nicolas Perrin in 2011. The racing car performance is quantitatively evaluated in terms of drag, downforce, 


efficiency and front balance. The goals of the present CFD (computational fluid dynamics)-based research are the following: 


analyzing the capabilities of the open-source software OpenFOAM in dealing with complex meshes and external aerodynamics 


calculation, and developing a reliable workflow from CAD (computer aided design) model to the post-processing of the results, in 


order to meet production demands. 


Key words: External aerodynamics, open-source CFD, 2017 F1 car, drag, downforce, efficiency, front balance. 


1. Introduction 


Nowadays CFD (computational fluid dynamics) 


and Motorsport are closely connected and 


interdependent: on the one hand, aerodynamic 
simulations are crucial for designing and developing 
increasingly fast vehicles; on the other hand, the 
extreme research of performance in motor racing is 
the catalyst behind the development of sophisticated 
and reliable numerical procedures and innovative 
CAE (computer aided engineering) tools. 

The impact of CFD on motorsport has grown up in 
tandem with computer hardware advances: looking at 
the Formula 1 experience during the period from 1990 
to 2010, simulations evolved from the inviscid panel 
method to one billion cell calculations of entire cars, 
including analysis of transient behaviour and 
overtaking [1]. As witnessed by the Formula | team 
Sauber Petronas, the CFD technology is applied in 
many stages of the vehicle development: early concept 


phase, system design (engine and brake cooling, brake 
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systems), single component design and complete 
Although 


open-wheeled racing car aerodynamics 1s basically an 


system design and interactions [2]. 
unusual field of research, there are a few publications 
from academic world as well as private industries: as 
an example of partnership between university and 
motorsport teams, Zhang, Toet and Zerihan [3] 
reviewed the progress made during the last 30 years 
on ground effect aerodynamics. 

From the point of view of the required 
computational resources, the complexity of the 
geometry and the resulting numerical issues, the 
simulation of realistic open-wheeled cars is really 
challenging: for this reason, Fl teams and researchers 
often rely on commercial software that provide 
user-friendliness, flexibility and reliability: the more 
you spend time on pre-processing and debugging, the 
lesser you can focus on design and_ physics 
comprehension. Examples of this type of study can be 
found in Refs. [4] and [5]: ANSYS software package 
is used to investigate the impact of 2009 FIA technical 
regulations on the aerodynamic performance of F1 


Cars. 
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In addition to CFD commercial solutions, there are 
open-source codes able to execute both the meshing 
phase and the fluid dynamic calculation: one of the 
most popular is OpenFOAM®. This free-license tool 
is successfully used and developed by academic 
researchers [6] and automotive industries [7] in order 
to predict the aerodynamic performance of road cars; 
however, due to some criticalities connected to 
meshing accuracy and numerical stability, it is not 
widespread in high level motorsport applications. In 
the current study, the highly complex aerodynamics of 
a 2017 Fl car has been numerically investigated by 
OpenFOAM®. Prediction reliability has been tested 
against reference data of drag, downforce, efficiency 
and front balance, provided by PERRINN. 


2. CFD Workflow: Software and Hardware 
Tools 


The first step of this study consists of developing a 
standard and reliable CFD workflow (from meshing to 
calculation) for external aerodynamic analysis of very 
of the 
open-source software OpenFOAM. Many degrees of 


complex geometries, through the use 
freedom are available in the case setup: the user can 
decide time and space discretization schemes of the 
Navier-Stokes equation terms as well as the solver for 
each variable [8]. 

The volume mesh is performed by SnappyHexMesh, 
an OpenFOAM utility providing a non-graphical, fast 
and flexible procedure for every kind of geometry, 
especially in external aerodynamics applications. This 
implies a huge time and resource saving in 
comparison with a traditional meshing software 
without a batch mesh utility. On the other hand, it is 
less accurate than some commercial meshing software, 
for instance in adding layers and tracing the edges of 
complex surfaces [8]. 

Both the meshing and the calculations were carried 
out using Galileo, the Italian Tier-1 cluster for 
industrial and public research, available at CINECA 


SCAI (supercomputing applications and innovations). 


The meshing processes were executed by means of 6 
computational nodes, each of which is composed of 
16 cores (8 GB/core); the calculations were instead 
performed using 14 nodes. About 1,500 iterations 
were required to get convergence on the basis of 


residuals lower than 10%. 


3. Pre-processing and Numerical Setup 
3.1 Geometry 


The input file of the geometry must be in 
STereoLithography format (stl). Many commercial 
CAD softwares are able to convert the original model 
in this format, but it is preferable to use only those 
providing a detailed control on the output file, since 
the quality of the stl model is directly connected to the 
quality of the volume mesh and the accuracy of the 
final fluid dynamic results. 

A final check of the .stl file 1s recommended in 
order to control orientation, closure of the surfaces, 
quality of triangles and edges: Netfabb Basic, a free 
software, was used in the current study. 

The stl file of the Fl car, obtained from the original 
project by PERRINN (Fig. 1), contains a lot of 
interesting features and _ challenges from _ the 
perspective of meshing. The full-scale Fl car model, 
whose wheelbase (WB) is 3.475 m long, presents 
many small realistic details such as winglets, fences, 
vortex generators and slots: the smallest elements are 
1.5 mm thick. 

Proximity problems can be found among _ the 
suspension arms, the front wing flaps and between the 
underbody and the ground: with the baseline setup 
(front ride height = 20 mm; rear ride height = 50 mm), 
the minimum distance between the plank and the 
ground is 13 mm. A contact patch between the tires 
and the ground, established by the front and rear ride 
height of the vehicle, needs to be defined in order to 
avoid problems of cell skewness. 

Before starting the meshing phase’ with 
SnappyHexMesh, the car model is divided into 


components, so as to analyse separately the behaviour 
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Fig. 1 Rendering of the 2017 F1 car by PERRINN (image from gpupdate.net). 
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Fig. 2 Geometry in stl format: (a) top view; (b) bottom view. 


of each part of the car body (Fig. 2). 
3.2 Mesh and Simulation Setup 


The domain length is about 18 times the WB of the 
vehicle: the distance between the inlet and the front 
axle is about 4.6 times the WB, while the outlet of the 
virtual tunnel, where the atmospheric pressure is 
imposed, is located well downstream of the car, 1.e. 
13.8 times the WB (Fig. 3). Since the simulation is 
steady and the vehicle is perfectly symmetrical, only 
half car is taken into account: the distance between the 
longitudinal symmetry plane and the sidewall is about 
16 times the half-width of the car. The height of the 
domain is 16 times the height (4) of the vehicle. Slip 


condition is imposed on the side wall and the ceiling 


of the wind tunnel, while the ground is moving at the 
same speed imposed at the inlet, for the purpose of 
comparison with the reference calculation made by 
PERRINN. Angular velocity and rotational axis of the 
wheels need to be defined. 

The main features of the mesh are as follows: the 
height of the first cell at all solid surfaces is 0.6 mm 
and the layer expansion ratio is 1.2. The resulting 
average value of y is about 40: this number obliges to 
use wall functions, as is currently done in industrial 
applications. 

Due to the complexity of the geometry and the 
related physical phenomena, many refinement boxes 
need to be defined. Special attention must be given to 


the huge wake region and the parts responsible for 
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downforce: the multi-component ground effect front 
wing, the rear wing composed of a high-cambered 
main plane and a high angle-of-attack flap, and finally 
the underbody, where the flow reaches its highest 
velocity. 

As suggested by preliminary study [8], two 
turbulence models were taken into account: the 
kOmegaSST (k@SST) and the SpalartAllmaras (S-A). 
Physical data needed to define the numerical setup 
of both 


simulations are summarized in Table 1. The car WB, 


and initialize the turbulent variables 
representing the size of the largest eddy, was chosen 
as turbulent length scale. The incompressible RANS 
simulations were performed by the coupled version of 
the simpleFoam algorithm, which is faster and more 
stable than the segregated one, at the cost of more 
computational resources. The GAMG (geometric 
algebraic multi grid) solver was used for the pressure 
equation, whilst smoothSolver was applied for 
velocity and turbulent variables. The entire calculation 
was executed with 2nd order discretization schemes. 
Convergence was considered to be reached whenever 
the scaled pressure and velocity residuals were lower 
than 10° and the aerodynamic coefficients remained 
stable (+ 1% in the last 500 iterations). 

Three different meshes were tested (140 mln, 120 
mln, 90 mln cells): since the results in terms of global 
performance did not change significantly, the coarsest 


one was chosen for the research. 
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4. Results and Discussion 


The comparison between numerical predictions and 
reference data from PERRINN database deals with 
drag (SCx), downforce (SCz), aerodynamic efficiency 
(Cz/Cx) and front balance (/'B), where S' is the frontal 
area of the car and FB is the ratio between the 
downforce on the front axle and the total downforce. 

The k@SST turbulence model predicted a premature 
separation of the flow on the suction side of the wings 
and along the diffuser: as a result, both downforce and 
drag coefficients were underestimated respectively by 
20% and 11%. 

On the opposite, S-A showed a better behaviour for 
boundary layer in adverse pressure gradient [8]: as 
summarized in Table 2, the results of the coupled 
RANS simulation with S-A model are consistent with 
the reference data. The percentage errors in prediction 
of drag and downforce are respectively 6% and 7%, 
whilst the front balance coefficient differs by 10% 
from reference datum. After proper validation, Table 3 
summarizes the contribution of the main vehicle 
components to the vertical load (SCz). The bottom of 
the car, composed of the underbody and the plank, 
generates more or less the 58% of the overall 
downforce, whilst the front and the rear wing provide 
respectively the 26.3% and the 27.5% of the total 
contribution. Also the front bodywork has a beneficial 


effect in terms of downforce; on the contrary, sidepod 


1b) San 





Fig.3 Volume mesh: (a) symmetry plane; (b) details of the refinement boxes around the car. 


Table 1 Physical conditions of the simulation. 


Variable 

Freestream velocity (u) 

Air density (p) 

Turbulent intensity (/) 
Wheelbase of the vehicle (WB) 
Reynolds Number (Reyz) 


Value 

50 m/s 
1.225 kg/m* 
0.15% 
3.475 m 
12x10° 
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Table 2 Comparison between numerical results and reference data (S-A model). 


SCx [m7] SCz [m*] 
Reference data 1.23 -3.59 
S-A results 1.16 -3.35 
Error % 5.7 6.7 


Cz/Cx FB 
2.92 0.448 
2.89 0.403 
1.0 10.0 


Table 3. Contribution to downforce of the main components of the car. 


Component SCz [m7] 
Cockpit +0.04 
Driver +0.02 
Front bodywork -0.10 
Front suspension +0.06 
Front wing -0.88 
Plank -0.45 
Rear suspension -0.01 
Rear wing -0.92 
Sidepod +0.2 
Underbody -1.49 
Upper bodywork +0.18 
Full car -3.35 


Contribution (%) 
si ee 
+0.6 
-3.0 
+1.8 
-26.3 
-13.4 
-0.3 
-27.5 
+6.0 
-44,5 
+5.4 
~-100 


Table 4 Contribution to drag of the main components of the car. 


Component SCx [m7] 
Front tire +0.115 
Rear tire +0.226 
Front bodywork +0.055 
Front suspension +0.012 
Front wing +0.159 
Rear suspension +0.024 
Rear wing +0.235 
Sidepod +0.035 
Underbody +0.181 
Upper bodywork +0.021 
Bargeboard +0.047 
Other parts +0.05 
Full car +1.16 


and upper bodywork generate undesirable lift because 
they deflect the flow downwards. 

Concerning SCx, one can see in Table 4 that wheels 
are responsible for approximately 30% of total drag. 

The underbody is the most efficient aerodynamic 
device, because it makes extensive use of ground 
effect and Venturi effect to generate downforce, in 
contrast to rear wing. Despite the complexity of the 
suspension geometry, its contribution to drag is only 
3%, owing to the fact that arm sections are 


streamlined like a wing profile. 


Contribution (%) 
+9.9 
+19.5 
+4.7 
+1.0 
+13.7 
+2,1 
+20.3 
+3.0 
+15.6 
+1.8 
+4,1 
+4.3 
~-100 


Fig. 4 illustrates the pressure coefficient (Cp) on the 
surface of the car. The bottom of the bodywork is 
characterized by typical low-pressure cores which are 
located at the beginning of the plank, where the 
ground clearance is smallest, at the entrance of 
underbody and rear diffuser. In close proximity to the 
rear tire disturbance, the pressure increases and the 
ground effect benefits are lost. The upper view shows 
the contribution to downforce of the front bodywork, 
due to the shape of the nose cone and the stagnation 


area in front of the cockpit. As regards the wings, it 
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can be noted that the rear wing generates downforce 
mainly due to the high camber of the airfoil design; on 
the contrary the front wing makes use of ground effect 
to accelerate the flow on the suction side. 

Both generation of downforce and induced drag are 
strictly connected with the management of axial 
vorticity: an overall view of these three-dimensional 
identified by the 


iso-surface of the scalar QO [1/s”]: this variable, defined 


rotational structures can be 


as the second invariant of the velocity gradient tensor, 


allows detection of the regions where the Euclidian 
norm of the vorticity tensor prevails over that of the 
rate of strain [9]. 

As illustrated in Fig. 5, the front wing generates not 
only downforce but also vortices that bypass the front 
tires. The bargeboard, apart from shielding the 
underbody from the tire wake, generates a pair of 
counter-rotating vortices: the upper one travels down 
the sidepod and acts as an aerodynamic skirt, sealing 


the low-pressure area under the underbody; the lower 
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Fig.4 Pressure coefficient contours: (a) top view; (b) bottom view. 





Fig.5 Iso-contour of Q = 50,000 1/s’: (a) top view; (b) bottom view. 
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vortex energizes the flow at the bottom of the car. 
Looking at the rear part of the vehicle, one can see 
the Venturi vortex developing at the inlet of the 
diffuser due to the difference of pressure between the 
underbody and the region at the side of the diffuser itself. 
Finally, the typical wingtip vortices detach from the 
rear wing, despite the presence of endplates and slots. 
Further details about the vortical structures around 
the single components of the vehicle bodywork can be 
examined by plotting axial vorticity contours. Fig. 6a 
clearly shows the presence of the so called Y250 
vortex: it develops between the neutral middle section 
and remaining profiles of the front wing and governs 
the flow towards the underbody inlet. The outwash 
endplate and the channel underneath the end of the 
wing (called Venturi channel) generate a couple of 
vortices, with negative vorticity, which help the flow 
to bypass the front tire. The interaction between the 
front wing and the front tire represents a really 


complex phenomenon due the unsteadiness of the 


flow and the 


parameters (camber, steering angle and toe angle) on 


influence of the tire alignment 


vortex development. Fig. 6b shows three main vortex 
cores: aS mentioned before, two of them are related to 
the bargeboard; the third one, located between the 
bargeboard itself and the plank, is generated by the 
delta-shaped part of the underbody. As witnessed by 
Fig. 6c, the vortex tubes develop along the entire step 
plane: the low pressure core of these fluid dynamic 
structures contributes to generation of downforce, in 
absence of side skirts that isolate the underbody flow. 

Concerning the rear region of the car (Fig. 6d), one 
can see the two vortices generated by the diffuser 
fences into the side of the main Venturi vortex. 
Looking at the flow underneath the car, each main 
vortex is coupled with a secondary structure, because 
of its interaction with the ground boundary layer. 

For the purpose of concluding the qualitative 
analysis of the flow, streamlines around the main 


components of the vehicle are plotted in Fig. 7. The dual 





Fig. 6 Axial vorticity contours: (a) front wing; (b) bargeboard; (c) middle underbody; (d) diffuser. 
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Fig. 7 Streamlines colored by axial vorticity: (a) front wing,; (b) bargeboard; (c) underbody (bottom view); (d) diffuser; (e) 


underbody (side view). 


task of the front wing is best highlighted by Fig. 7a: 
apart from the primary function of generating 
downforce, it energizes the flow for better feeding the 
downstream aerodynamic devices, such as_ the 
underfloor. 

Fig. 7b puts in evidence the interaction between 
bargeboard and sidepod panel: the first acts like a 
huge vortex generator; the latter maintains the 
energized flow attached to the sides of the vehicle. 
The downforce generated by the underbody is a direct 
result of the vehicle front-end design: in fact, the 


diffuser is partially fed by the flow bypassing the front 


tires (Fig. 7c). As witnessed by Fig. 7d, the entire 
highly 


three-dimensional streamlines strengthened by the 


underfloor 1S characterized by 
diffuser activity: it accelerates the flow at its inlet and 
creates additional downforce by means of strong 
vortices. Fig. 7e gives evidence of the diffuser impact 
on the outflow: the streamlines are deflected upwards, 
giving rise to a huge wake. Besides the flow coming 
from the underfloor, the overall wake consists of 
several contributions, including that of rear wing and 
rear tires: this explains why a modern FI car is not 


able to generate an adequate amount of downforce in 
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slipstream. 


5. Conclusions 


Aerodynamic performance of a Fl 2017 car 
designed by PERRINN_ was 
open-source software OpenFOAM. The meshing phase 


analysed by _ the 


was particularly tricky because of the sophisticated 
geometry of the vehicle. SnappyHexMesh, — the 
dedicated OpenFOAM tool, despite some challenges 
such as the layer addition algorithm, provided a fast 
and automatic meshing procedure. 

In view of the simulation complexity, a coupled 
approach was chosen to avoid numerical instability in 
the very first iterations and reduce the number of 
iterations to reach convergence. The results of the S-A 
RANS incompressible calculation were found to be in 
good agreement with the reference data in terms of 
drag, downforce efficiency and front balance. As a 
general comment, OpenFOAM is a good instrument 
for external aerodynamic investigations, considering 
that it is license-free and it is particularly suitable for 
parallel computing. 

Regarding the Fl car aerodynamics, it can be 
concluded that the flow field involves slender and 
blunt bodies interacting with each other. A F1 car 
generates downforce in different ways: by inverted 
multi-element wings, ground effect of the underbody, 
effect of the diffuser 


management. The contributions of front and rear wing 


Venturi and vorticity 
to downforce are 26.3% and 27.5%, respectively; most 
of the remaining percentage is attributable to the 
underbody. Front and rear wheels are the main source 
of pressure drag, because they generate a huge wake. 
The rear wing is primarily responsible for induced 
drag, as a result of axial vortices detaching from the 
wingtips. The front wing deserves a comment of its 
own: its axial vortices can be used to improve the 


performance of the underbody and consequently 


generate more downforce. 

It appears that ground effect is the most convenient 
way to generate downforce: for this reason underbody 
is more efficient than front wing, and front wing is in 


turn more efficient than rear wing. 
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